page

Apr 8, 2019

Statistical significance of the overlap between two groups of genes : hypergeometric distribution


Statistical significance of the overlap between two groups of genes

http://nemates.org/MA/progs/overlap_stats.html


How do I calculate if the degree of overlap between two lists is significant?

https://stats.stackexchange.com/questions/267/how-do-i-calculate-if-the-degree-of-overlap-between-two-lists-is-significant



If I understand your question correctly, you need to use the Hypergeometric distribution. This distribution is usually associated with urn models, i.e there are n balls in an urn, y are painted red, and you draw m balls from the urn. Then if X is the number of balls in your sample of m that are red, X has a hyper-geometric distribution.
For your specific example, let nA , nB and nC denote the lengths of your three lists and let nAB denote the overlap between A and B. Then

nABHG(nA,nC,nB)

To calculate a p-value, you could use this R command:
#Some example values
n_A = 100;n_B = 200; n_C = 500; n_A_B = 50
1-phyper(n_A_B, n_B, n_C-n_B, n_A)
[1] 0.008626697
Word of caution. Remember multiple testing, i.e. if you have lots of A and B lists, then you will need to adjust your p-values with a correction. For the example the FDR or Bonferroni corrections.

csgillespie's answer seems correct except for one thing: it gives the probability of seeing strictly more than n_A_B in the overlap, P(x > n_A_B), but I think OP wants the pvalue P(x >= n_A_B). You could get the latter by
n_A = 100;n_B = 200; n_C = 500; n_A_B = 50
phyper(n_A_B - 1, n_A, n_C-n_A, n_B, lower.tail = FALSE) 

No comments:

Post a Comment