page

May 31, 2019

Apr 8, 2019

Statistical significance of the overlap between two groups of genes : hypergeometric distribution


Statistical significance of the overlap between two groups of genes

http://nemates.org/MA/progs/overlap_stats.html


How do I calculate if the degree of overlap between two lists is significant?

https://stats.stackexchange.com/questions/267/how-do-i-calculate-if-the-degree-of-overlap-between-two-lists-is-significant



If I understand your question correctly, you need to use the Hypergeometric distribution. This distribution is usually associated with urn models, i.e there are n balls in an urn, y are painted red, and you draw m balls from the urn. Then if X is the number of balls in your sample of m that are red, X has a hyper-geometric distribution.
For your specific example, let nA , nB and nC denote the lengths of your three lists and let nAB denote the overlap between A and B. Then

nABHG(nA,nC,nB)

To calculate a p-value, you could use this R command:
#Some example values
n_A = 100;n_B = 200; n_C = 500; n_A_B = 50
1-phyper(n_A_B, n_B, n_C-n_B, n_A)
[1] 0.008626697
Word of caution. Remember multiple testing, i.e. if you have lots of A and B lists, then you will need to adjust your p-values with a correction. For the example the FDR or Bonferroni corrections.

csgillespie's answer seems correct except for one thing: it gives the probability of seeing strictly more than n_A_B in the overlap, P(x > n_A_B), but I think OP wants the pvalue P(x >= n_A_B). You could get the latter by
n_A = 100;n_B = 200; n_C = 500; n_A_B = 50
phyper(n_A_B - 1, n_A, n_C-n_A, n_B, lower.tail = FALSE) 

samtools rmdup vs Picard MarkDuplicates

samtools rmdup PE
http://seqanswers.com/forums/showthread.php?t=5959

If you have one pair of reads where read 1 starts at position 100, and the other end starts at position 200, and a second pair of reads where read 1 starts at position 100, and read 2 starts at position 250, those came from different fragments of DNA. You can tell because the read 2 start is different, even though the read 1 start is the same.

When treating the reads as paired end, none of those reads should be deleted as PCR duplicates.

However, if you ran rmdup -S, the software will not check to see if read 2 has a different start coordinate, so one of those read 1 reads will be treated as a duplicate, and deleted.


Question: Samtools Dedup Documentation
 https://www.biostars.org/p/55111/

rmdup for PE reads is pretty straightforward. It looks for identical external coordinates, meaning it only looks at the 5' start coordinates of the FR orientation pair-reads. Then it takes the pair with the highest mapping quality.
For SE reads, I've read that samtools also only looks for identical 5' start coordinates, not both start and end coordinates. I think the idea is that sequencers usually fall in quality towards the 3'. After mapping, duplicate reads have higher chance of mapping differentially towards the 3' end. So it only looks at the adapter trimmed 5' start for duplicates.


Question: Picard MarkDuplicates and SamTools rmdup algorithm documentation
https://www.biostars.org/p/105291/
SamTools rmdup 'only' compare two reads on chrom and pos (which could be wrong if two reads come from two different libraries) and **removes** reads from the BAM: information is lost.

picard set the sam flag 1024 but do not delete the reads. two pairs of reads are compared , as far as I know, using the chrom, the pos, the group-id (sample...) + (flowcell , lane, X,Y for optical dups) (,and the cigar string ?).



Picard MarkDuplicates vs samtools rmdup for variant calling with GATK
https://gatkforums.broadinstitute.org/gatk/discussion/6793/picard-markduplicates-vs-samtools-rmdup-for-variant-calling-with-gatk 
 - this post explains Picard's duplicate marking tools





NovaSeq : 2-color chemistry and Base quality problem

 Because of 2-color chemistry of Nova-seq, base quality of Nova-seq could be worse compared to Hiseq which uses 4-color chemistry

 

On NovaSeq Base Quality 

http://lh3.github.io/2017/07/24/on-nonvaseq-base-quality

 

Illumina 2 colour chemistry can overcall high confidence G bases

https://sequencing.qcfail.com/articles/illumina-2-colour-chemistry-can-overcall-high-confidence-g-bases/

Related image


WGS from NovaSeq compared to HiSeq



https://www.reddit.com/r/bioinformatics/comments/93eqjm/wgs_from_novaseq_compared_to_hiseq/ 

 


A first look at Illumina’s new NextSeq 500  ( NextSeq use 2-color chemistry)


http://seqanswers.com/forums/showthread.php?t=40741 

Nov 27, 2018

HiSeq, MiSeq, NovaSeq : Read length, # of cluster per lane, Base per lane

https://medicine.yale.edu/keck/ycga/sequencing/illumina/hiseq.aspx

Sequencer  Read length  # of Clusters per lane (millions)  Bases per lane (Gbp)
HiSeq 2500 Rapid  1x75  150 11.25
HiSeq 2500 Rapid  2x75  150 22.5
HiSeq 2500 Rapid  2x150  150 45
HiSeq 2500 High-output   1x75  200 15
HiSeq 2500 High-output   2x75  200 30
HiSeq 4000  2x100  300 60
HiSeq 4000  2x150  300 90
NovaSeq S2  2x100  1650 330
NovaSeq S2  2x150  1650 500
NovaSeq S4  2x150  2000 600

https://www.fasteris.com/dna/?q=node/41

Illumina HiSeq Services:
Service Type      Run (Mode)           Number of pass filter clusters (1)    Yield (1)   Sequencing Run Time (2)
HiSeq 4000 (3)
- 1x lane
1x50 bp Up to 350 million,
average yield 280-300 mio (*)
 17.5 Gb  1 - 4 days
1x150 bp  52.5 Gb 
2x75 bp  52.5 Gb 
2x150 bp  105 Gb 
HiSeq 2500, v4 (4) 
- 1x lane
2x125 bp Up to 280 million,
average yield 240-260 mio (*)
 70 Gb  8 days
HiSeq 2500 Rapid Run, v2
- 1x flow cell run
1x50 bp Up to 300 million,
average yield 240-260 mio (*)
15 Gb  1 - 3 days
1x125 bp 37.5 Gb 
2x50 bp  30 Gb 
2x125 bp  75 Gb 
2x250 bp 150 Gb
2x300 bp 180 Gb 
1x test run
(done on MiSeq Nano) (5)
1x50 bp 500'000 to 1 million reads 50 Mb + 4 days
1x test run
(done on MiSeq Nano) (5)
1x150 bp 500'000 to 1 million reads  150 Mb  + 4 days
(1) All the calculations have been made with specifications and typical Fasteris results when using optimized loading conditions. Individual results may vary.
(2) This time is the "pure" running time of the system and includes time for washing. Not included is the time for doing the entry QC, library preparation, collection of other lane projects and/or the time for de-multiplexing of the derived data files.
(3) On HiSeq 4000, we typically can reach 300-320 million passed-filter DNA clusters, up to 350 million or more pass filter DNA clusters per lane.
(4) On HiSeq 2500 we typically can reach 250-280 million passed-filter DNA clusters per lane. For libraries with a test run done, we guarantee a minimum of 250 million pass filter DNA clusters per lane.
(5) A test sequencing run is an extra service step needed to be ordered when min guaranteed data output and min guaranteed data yield is needed, especially for sequencing of customer-provided ready-to-run (RTR) libraries.
(*) Average yields are given by statistic analysis of our runs; not included are biased or non-standard samples.
Conditions and Details are subject to changes without notice.


https://www.illumina.com/systems/sequencing-platforms/miseq/specifications.html

Cluster Generation and Sequencing

  MiSeq Reagent Kit v2 MiSeq Reagent Kit v3
Read Length 1 × 36 bp 2 × 25 bp 2 × 150 bp 2 × 250 bp 2 × 75 bp 2 × 300 bp
Total Time* ~4 hrs ~5.5 hrs ~24 hrs ~39 hrs ~21 hrs ~56 hrs
Output 540–610 Mb 750–850 Mb 4.5–5.1 Gb 7.5–8.5 Gb 3.3–3.8 Gb 13.2–15 Gb

  MiSeq Reagent Kit v2 Micro MiSeq Reagent Kit v2 Nano
Read Length 2 × 150 bp 2 × 250 bp 2 × 150 bp
Total Time* ~19 hrs ~28 hrs ~17 hrs
Output 1.2 Gb 500 Mb 300 Mb
* Total time includes cluster generation, sequencing, and base calling on a MiSeq System enabled with dual-surface scanning.

Reads Passing Filter**


MiSeq Reagent Kit v2 MiSeq Reagent Kit v3 MiSeq Reagent Kit v2 Micro MiSeq Reagent Kit v2 Nano
Single Reads 12-15 million 22–25 million 4 million 1 million
Paired-End Reads 24–30 million 44–50 million 8 million 2 million

  ** Install specifications based on Illumina PhiX control library at supported cluster densities (865-965 k/mm2 clusters passing filter for v2 chemistry and 1200-1400 k/mm2 clusters passing filter for v3 chemistry). Actual performance parameters may vary based on sample type, sample quality, and clusters passing filter.

 



Nov 22, 2018

Mount disks with HFS+ volumes ( used in Mac OS) in CentOS

http://opensysblog.directorioc.net/2015/07/centos-mount-disks-with-hfs-volumes.html
https://www.centos.org/forums/viewtopic.php?t=67360


https://askubuntu.com/questions/332315/how-to-read-and-write-hfs-journaled-external-hdd-in-ubuntu-without-access-to-os

$ sudo rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
$ sudo rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm
$ sudo yum install kmod-hfsplus
$ sudo yum install kmod-hfs 
$ sudo yum install hfsplus-tools

after install, you might need reboot
 
By default, the partition will be mounted in read-only mode
use force option to activate read-write mode

$ sudo mount -t hfsplus -o force -o rw /dev/sdc2 /media/disk1/