NGS scrap: 2018

Nov 27, 2018

HiSeq, MiSeq, NovaSeq : Read length, # of cluster per lane, Base per lane

https://medicine.yale.edu/keck/ycga/sequencing/illumina/hiseq.aspx

Sequencer	Read length	# of Clusters per lane (millions)	Bases per lane (Gbp)
HiSeq 2500 Rapid	1x75	150	11.25
HiSeq 2500 Rapid	2x75	150	22.5
HiSeq 2500 Rapid	2x150	150	45
HiSeq 2500 High-output	1x75	200	15
HiSeq 2500 High-output	2x75	200	30
HiSeq 4000	2x100	300	60
HiSeq 4000	2x150	300	90
NovaSeq S2	2x100	1650	330
NovaSeq S2	2x150	1650	500
NovaSeq S4	2x150	2000	600

https://www.fasteris.com/dna/?q=node/41

Illumina HiSeq Services:

Service Type	Run (Mode)	Number of pass filter clusters ⁽¹⁾	Yield ⁽¹⁾	Sequencing Run Time ⁽²⁾
HiSeq 4000 ⁽³⁾ - 1x lane	1x50 bp	Up to 350 million, average yield 280-300 mio ^(*)	17.5 Gb	1 - 4 days
	1x150 bp		52.5 Gb
	2x75 bp		52.5 Gb
	2x150 bp		105 Gb
HiSeq 2500, v4 ⁽⁴⁾ - 1x lane	2x125 bp	Up to 280 million, average yield 240-260 mio ^(*)	70 Gb	8 days
HiSeq 2500 Rapid Run, v2 - 1x flow cell run	1x50 bp	Up to 300 million, average yield 240-260 mio ^(*)	15 Gb	1 - 3 days
	1x125 bp		37.5 Gb
	2x50 bp		30 Gb
	2x125 bp		75 Gb
	2x250 bp		150 Gb
	2x300 bp		180 Gb
1x test run (done on MiSeq Nano) ⁽⁵⁾	1x50 bp	500'000 to 1 million reads	50 Mb	+ 4 days
1x test run (done on MiSeq Nano) ⁽⁵⁾	1x150 bp	500'000 to 1 million reads	150 Mb	+ 4 days
(1) All the calculations have been made with specifications and typical Fasteris results when using optimized loading conditions. Individual results may vary. (2) This time is the "pure" running time of the system and includes time for washing. Not included is the time for doing the entry QC, library preparation, collection of other lane projects and/or the time for de-multiplexing of the derived data files. (3) On HiSeq 4000, we typically can reach 300-320 million passed-filter DNA clusters, up to 350 million or more pass filter DNA clusters per lane. (4) On HiSeq 2500 we typically can reach 250-280 million passed-filter DNA clusters per lane. For libraries with a test run done, we guarantee a minimum of 250 million pass filter DNA clusters per lane. (5) A test sequencing run is an extra service step needed to be ordered when min guaranteed data output and min guaranteed data yield is needed, especially for sequencing of customer-provided ready-to-run (RTR) libraries. () Average yields are given by statistic analysis of our runs; not included are biased or non-standard samples. Conditions and Details are subject to changes without notice.*

https://www.illumina.com/systems/sequencing-platforms/miseq/specifications.html

Cluster Generation and Sequencing

	MiSeq Reagent Kit v2				MiSeq Reagent Kit v3
Read Length	1 × 36 bp	2 × 25 bp	2 × 150 bp	2 × 250 bp	2 × 75 bp	2 × 300 bp
Total Time*	~4 hrs	~5.5 hrs	~24 hrs	~39 hrs	~21 hrs	~56 hrs
Output	540–610 Mb	750–850 Mb	4.5–5.1 Gb	7.5–8.5 Gb	3.3–3.8 Gb	13.2–15 Gb

	MiSeq Reagent Kit v2 Micro	MiSeq Reagent Kit v2 Nano
Read Length	2 × 150 bp	2 × 250 bp	2 × 150 bp
Total Time*	~19 hrs	~28 hrs	~17 hrs
Output	1.2 Gb	500 Mb	300 Mb

* Total time includes cluster generation, sequencing, and base calling on a MiSeq System enabled with dual-surface scanning.

Reads Passing Filter^**

	MiSeq Reagent Kit v2	MiSeq Reagent Kit v3	MiSeq Reagent Kit v2 Micro	MiSeq Reagent Kit v2 Nano
Single Reads	12-15 million	22–25 million	4 million	1 million
Paired-End Reads	24–30 million	44–50 million	8 million	2 million

** Install specifications based on Illumina PhiX control library at supported cluster densities (865-965 k/mm² clusters passing filter for v2 chemistry and 1200-1400 k/mm² clusters passing filter for v3 chemistry). Actual performance parameters may vary based on sample type, sample quality, and clusters passing filter.

Nov 22, 2018

Mount disks with HFS+ volumes ( used in Mac OS) in CentOS

http://opensysblog.directorioc.net/2015/07/centos-mount-disks-with-hfs-volumes.html
https://www.centos.org/forums/viewtopic.php?t=67360

https://askubuntu.com/questions/332315/how-to-read-and-write-hfs-journaled-external-hdd-in-ubuntu-without-access-to-os

$ sudo rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org

$ sudo rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm
$ sudo yum install kmod-hfsplus

$ sudo yum install kmod-hfs 
$ sudo yum install hfsplus-tools

after install, you might need reboot

By default, the partition will be mounted in read-only mode
use force option to activate read-write mode

$ sudo mount -t hfsplus -o force -o rw /dev/sdc2 /media/disk1/

Nov 18, 2018

awk Command In Text Processing

30 Examples For Awk
https://likegeeks.com/awk-command/

awk and, or, not operator
https://www.poftut.com/awk-logical-operators-not/

awk
awk 는 필드 단위로 패턴을 검색하고, 조작하는 것을 주 목적으로 한다. 파일 내의 각 라인 (레코드)에 대한 필드 를 인식할 수 있는 패턴 검색 기능을 보유하고 있으며, 이를 이용해서 조작도 가능하다.

awk [-f 프로그램파일] [-F 필드구분자] ["패턴{액션}"] [처리할 파일명]

옵션

-f 프로그램 파일

awk 유틸리티의 실행 액션이 저장된 프로그램 파일을 지정

-F 필드 구분자

필드 구분자를 지정

"패턴{액션}"

패턴이 일치하면 액션이 실행된다.

사용예

$> awk -F : "{print $1, $6}" ./text.txt

= text.txt 파일에서 ":" 구부자를 이용하여 1번째와 6번째 필드를 프린트

패턴과 액션의 구조
awk의 옵션 중의 하나였던 ["패턴{액션}"] 에서도 여러가지 구조가 존재한다. 가장 대표적인 3가지를 알아보자

1. BEGIN

- 첫번째 레코드를 읽기 전에 지정된 액션을 실행

2. END

- 마지막 레코드를 읽고 난 후, 지정된 액션을 실행

3. PATTERN

- 입력되는 각 라인(레코드)별로 실행되며, 만약 그 라인이 패턴과 일치할 경우 액션이 실행된다.

- 정규표현식의 경우 "/정규식/" 으로 나타낸다.

- 패턴만 있는 경우 : 패턴과 일치하는 라인을 화면에 출력한다.

- 액션만 있는경우 : 모든 라인이 액션의 대상이 된다.

awk 시스템 변수
awk가 내부적으로 인식하는 변수들이다. 이것을 이용하면 조금 더 효율적으로 사용 할 수 있다.

변수명	내용
FILENAME	현재 처리중인 파일명
FS	필드 구분자로 디폴트는 공백
RS	레코드 구분자로 디폴트는 새로운 라인
NF	현재 레코드의 필드 개수
NR	현재 레코드의 번호
OFS	출력할 때 사용하는 FS
ORS	출력할 때 사용하는 RS
$0	입력 레코드의 전체
$n	입력 레코드의 n번째 필드

awk 사용 예

awk "{print FILENAME}" test.txt
= test.txt 파일의 레코드 개수만큼 파일이름을 출력한다.

awk "{print NR}" test.txt
= test.txt 파일의 레코드 번호를 출력한다.

awk 'BEGIN {FS="\t"} {print $1 , $2}' test.txt

= test.txt 파일의 필드 구분자를 "\t" 으로 지정하고, 1번째와 2번째 필드를 프린트 한다.

awk 'BEGIN {FS="\t"; OFS ="-"} {print $1 , $2} END {print "총 레코드의 수 : " NR}' test.txt

= test.txt 파일의 필드 구분자를 "\t" 로 지정하고, 1번째와 2번째 필드를 프린트 하되, 필드 구분자를 "-"

로 바꾸어 출력하고, 모든 레코드가 끝난 뒤, 총 레코드의 수를 출력한다.

결과화면

출처: http://ra2kstar.tistory.com/153 [초보개발자 이야기.]

awk [-f awk_program_file] [-F field identifier] ["pattern{action}"] [input file]

-f awk_program_file
load pre-defined awk run action file

-F field identifier
set field identifier

"pattern{action}"
if find pattern, run action

example

$> awk -F : "{print $1, $6}" ./text.txt
= in text.txt file, use ":" for field identifier, print 1st and 6th field

pattern & action
1. BEGIN
- do action before reading first record

2. END
- do action after reading last record

3. PATTERN
- run by each line (record), if line is matched with pattern, run action
- if you want to use regular expression : "/regular_expression/"
- if you use just pattern : print line which is matched with pattern
- if you use just action : every lines are object of action

and

(CONDITION && CONDITION && ... )

(CONDITION || CONDITION || ... )

not

!(CONDITION)

Nov 12, 2018

Add a New Disk Larger Than 2TB to An Existing Linux

see more details in
https://www.tecmint.com/add-disk-larger-than-2tb-to-an-existing-linux/

# list current partitions

[lee@hwang29 ~]$ sudo fdisk -l

# move to disk
[lee@hwang29 ~]$ sudo fdisk /dev/sda

d : delete a partition # deletes the partition. It will delete the data on the disk
w : write table to disk and exit

# write partition with parted
[lee@hwang29 ~]$ sudo parted /dev/sda

(parted) mklabel gpt # set partition table format to GPT (GUID Partition Table)
(parted) mkpart primary 0GB 3000GB # create primary partition, assaing disk capacity
(parted) quit

# Creating an ext4 File System
[lee@hwang29 ~]$ sudo mkfs.ext4 /dev/sda

# mount disk
[lee@hwang29 ~]$ sudo mount /dev/sda /data_1

# add entry to /etc/fstab for permanent mounting
[lee@hwang29 ~]$ vim /etc/fstab
/dev/sda /data_1 ext4 defaults 0 0

tip for naming the disk
- /dev/disk/by-id/scsi-SATA_ModelName_partion1 instead of /dev/disk/sda
- if you are using multiple disk/partition and one of them is failed, it is easy to identify which disk and partition goes wrong
- if you change, add, move disk, it might cause problems because other disk already use name sda, sdb

page