Statistical significance of the overlap between two groups of genes
http://nemates.org/MA/progs/overlap_stats.htmlHow do I calculate if the degree of overlap between two lists is significant?
https://stats.stackexchange.com/questions/267/how-do-i-calculate-if-the-degree-of-overlap-between-two-lists-is-significant
If I understand your question correctly, you need to use the Hypergeometric distribution. This distribution is usually associated with urn models, i.e there are n balls in an urn, y are painted red, and you draw m balls from the urn. Then if X is the number of balls in your sample of m that are red, X has a hyper-geometric distribution.
For your specific example, letnA , nB and nC denote the lengths of your three lists and let nAB denote the overlap between A and B . Then
nAB∼HG(nA,nC,nB)
To calculate a p-value, you could use this R command:
For your specific example, let
To calculate a p-value, you could use this R command:
#Some example values
n_A = 100;n_B = 200; n_C = 500; n_A_B = 50
1-phyper(n_A_B, n_B, n_C-n_B, n_A)
[1] 0.008626697
Word of caution. Remember multiple testing, i.e. if you have lots of A and B lists, then you will need to adjust your p-values with a correction. For the example the FDR or Bonferroni corrections.
csgillespie's answer seems correct except for one thing: it gives the probability of seeing strictly more than n_A_B in the overlap, P(x > n_A_B), but I think OP wants the pvalue P(x >= n_A_B). You could get the latter by
n_A = 100;n_B = 200; n_C = 500; n_A_B = 50
phyper(n_A_B - 1, n_A, n_C-n_A, n_B, lower.tail = FALSE)