page

Apr 8, 2019

Statistical significance of the overlap between two groups of genes : hypergeometric distribution


Statistical significance of the overlap between two groups of genes

http://nemates.org/MA/progs/overlap_stats.html


How do I calculate if the degree of overlap between two lists is significant?

https://stats.stackexchange.com/questions/267/how-do-i-calculate-if-the-degree-of-overlap-between-two-lists-is-significant



If I understand your question correctly, you need to use the Hypergeometric distribution. This distribution is usually associated with urn models, i.e there are n balls in an urn, y are painted red, and you draw m balls from the urn. Then if X is the number of balls in your sample of m that are red, X has a hyper-geometric distribution.
For your specific example, let nA , nB and nC denote the lengths of your three lists and let nAB denote the overlap between A and B. Then

nABHG(nA,nC,nB)

To calculate a p-value, you could use this R command:
#Some example values
n_A = 100;n_B = 200; n_C = 500; n_A_B = 50
1-phyper(n_A_B, n_B, n_C-n_B, n_A)
[1] 0.008626697
Word of caution. Remember multiple testing, i.e. if you have lots of A and B lists, then you will need to adjust your p-values with a correction. For the example the FDR or Bonferroni corrections.

csgillespie's answer seems correct except for one thing: it gives the probability of seeing strictly more than n_A_B in the overlap, P(x > n_A_B), but I think OP wants the pvalue P(x >= n_A_B). You could get the latter by
n_A = 100;n_B = 200; n_C = 500; n_A_B = 50
phyper(n_A_B - 1, n_A, n_C-n_A, n_B, lower.tail = FALSE) 

samtools rmdup vs Picard MarkDuplicates

samtools rmdup PE
http://seqanswers.com/forums/showthread.php?t=5959

If you have one pair of reads where read 1 starts at position 100, and the other end starts at position 200, and a second pair of reads where read 1 starts at position 100, and read 2 starts at position 250, those came from different fragments of DNA. You can tell because the read 2 start is different, even though the read 1 start is the same.

When treating the reads as paired end, none of those reads should be deleted as PCR duplicates.

However, if you ran rmdup -S, the software will not check to see if read 2 has a different start coordinate, so one of those read 1 reads will be treated as a duplicate, and deleted.


Question: Samtools Dedup Documentation
 https://www.biostars.org/p/55111/

rmdup for PE reads is pretty straightforward. It looks for identical external coordinates, meaning it only looks at the 5' start coordinates of the FR orientation pair-reads. Then it takes the pair with the highest mapping quality.
For SE reads, I've read that samtools also only looks for identical 5' start coordinates, not both start and end coordinates. I think the idea is that sequencers usually fall in quality towards the 3'. After mapping, duplicate reads have higher chance of mapping differentially towards the 3' end. So it only looks at the adapter trimmed 5' start for duplicates.


Question: Picard MarkDuplicates and SamTools rmdup algorithm documentation
https://www.biostars.org/p/105291/
SamTools rmdup 'only' compare two reads on chrom and pos (which could be wrong if two reads come from two different libraries) and **removes** reads from the BAM: information is lost.

picard set the sam flag 1024 but do not delete the reads. two pairs of reads are compared , as far as I know, using the chrom, the pos, the group-id (sample...) + (flowcell , lane, X,Y for optical dups) (,and the cigar string ?).



Picard MarkDuplicates vs samtools rmdup for variant calling with GATK
https://gatkforums.broadinstitute.org/gatk/discussion/6793/picard-markduplicates-vs-samtools-rmdup-for-variant-calling-with-gatk 
 - this post explains Picard's duplicate marking tools





NovaSeq : 2-color chemistry and Base quality problem

 Because of 2-color chemistry of Nova-seq, base quality of Nova-seq could be worse compared to Hiseq which uses 4-color chemistry

 

On NovaSeq Base Quality 

http://lh3.github.io/2017/07/24/on-nonvaseq-base-quality

 

Illumina 2 colour chemistry can overcall high confidence G bases

https://sequencing.qcfail.com/articles/illumina-2-colour-chemistry-can-overcall-high-confidence-g-bases/

Related image


WGS from NovaSeq compared to HiSeq



https://www.reddit.com/r/bioinformatics/comments/93eqjm/wgs_from_novaseq_compared_to_hiseq/ 

 


A first look at Illumina’s new NextSeq 500  ( NextSeq use 2-color chemistry)


http://seqanswers.com/forums/showthread.php?t=40741