page

May 31, 2019

Differential expression of FPKM from RNA-seq data using limma and voom()

Differential expression of FPKM from RNA-seq data using limma and voom()

https://support.bioconductor.org/p/56275/

Q.


I have a count matrix of FPKM values and I want to estimate differentially expressed genes between two conditions. First I used DESeq2, but I realized that this is not good for FPKM values. I then transformed the counts using voom() in the limma package and then used:

fit <- lmFit(myVoomData,design)
fit <- eBayes(fit)
options(digits=3)
writefile = topTable(fit,n=Inf,sort="none", p.value=0.01)
write.csv(writefile, file="file.csv")
My problem is that all of the 6156 genes are differentially expressed (p-value 0.01). Only a few hundred were differentially expressed using DESe2, but I guess that can't be trusted.
I am new to this type of analysis, and to R, but is it ok to simply transform the data by voom()? Can I use the transformed data in DESeq2? Any other ways I can use FPKM counts to estimate differentially expressed genes?

A.

No, it is absolutely not ok to input FPKM values to voom. The results will be nonsense. Same goes for edgeR and DESeq, for the reasons explained in the documentation for those packages.
Note that a matrix of FPKM is not a matrix of counts.
It seems to me that you have three options in decreasing order of desirability:
  1. Get the actual integer counts from which the FPKM were computed and do a proper analysis, for example using voom or edgeR.
  2. Get the gene lengths and library sizes used to compute the FPKM and convert the FPKM back to counts.
  3. If FPKM is really all you have, then convert the values to a log2 scale (y = log2(FPKM+0.1) say) and do an ordinary limma analysis as you would for microarray data, using eBayes() with trend=TRUE. Do not use voom, do not use edgeR, do not use DESeq. (Do not pass go and do not collect $200.) This isn't 100% ideal, but is probably the best analysis available.
The third option is similar to the "limma-trend" analysis described in the limma preprint, except that it is applied to the logFPKM instead of logCPM. Statistically this will not perform as well as it would applied to the logCPM.




What you're doing wrong is trying to do a DE analysis using FPKMs in the first place. There is no good answer to what the FPKM offset should be to avoid taking the log of zero, which is one of the problems with using FPKMs. I would personally use a larger offset than any you have used, certainly not less than 0.1, but then I wouldn't do it at all.
The way to choose a tuning parameter like this is to plot the data to see that it looks vaguely reasonable rather than to count the number of DE genes.
I assume you have filtered the data so that you are not trying to analyse genes for which all, or almost all, the FPKM are zero. Having lots of FPKM=0 would lead to the warnings about zero variances.
I also suggest you used eBayes with robust=TRUE as well as trend=TRUE. That will reduce the sensitivity to zero variances.

No comments:

Post a Comment