Efficient Test and Visualization of Multi-Set Intersections
Scientific Reports
volume 5, Article number: 16923 (2015)
https://www.nature.com/articles/srep16923
Scientific Reports
volume 5, Article number: 16923 (2015)
https://www.nature.com/articles/srep16923
https://www.biostars.org/p/388584/
TMM is a method for normalizing the library sizes rather than a method for normalizing read counts. As the edgeR User's Guide says (page 15):
normalization in edgeR is model-based, and the original read counts are not themselves transformed.
Which way around is your question? Do you have TPMs and want to compute TMM factors or do you have TMM factors and want to compute TPMs?
If you are asking the first question, then no, TMM factors can only be computed from the raw counts, not from quantities such as TPMs or CPMs from which the library sizes have already been divided out. If you already have TPMs from some software package, then normalization has almost certainly already been applied, so I would be very wary about trying to re-normalize them unless you really know what you're doing.
If you are asking the second question then, yes, TMM factors can in principle be used to compute TPMs. In edgeR, any downstream quantity that is computed from the library sizes will incorporate the TMM factors automatically, because the factors are considered part of the effective library sizes. TMM normalization factors will be applied automatically when you use
CPM <- cpm(dge)
or
RPKM <- rpkm(dge)
in edgeR to compute CPMs or RPKMs from a DGEList object. I don't necessarily recommend TPM values myself, but if you go on to compute TPMs by
TPM <- t( t(RPKM) / colSums(RPKM) ) * 1e6
then the TMM factors will naturally have been incorporated into the computation.
http://seqanswers.com/forums/showthread.php?t=13902
The bias will affect estimates of absolute expression, but once you
calculate a fold change for a gene by comparing several samples, it
should cancel out.
This holds if the patterns are the same in all samples. If they are not,
you might get better results when adjusting for it. This is at least
what Hansen et al. claim in their follow-up paper, a preprint of which
you can find here: http://www.bepress.com/jhubiostat/paper227/
Good to read : cover broad issue in RNA-seq
https://www.labome.com/method/RNA-seq.html
Currently, all commercially available RNA-seq platforms rely on reverse transcription and PCR amplification prior to sequencing and sequencing is therefore subject to the biases inherent to these procedures. First, annealing of random hexamer primers to fragmented RNA is not random, which results in depletion of reads at both 5’ and 3’ ends [3-6]
Figure 2. Sequence logo showing observed and expected nucleotide distribution surrounding the 5’ fragmentation site. Similar biases are present at the 3’ end. Image: Roberts et al. [3] (image released under a Creative Commons Attribution License).
Figure 3. Read coverage over genes is biased against 3’ and 5’ extremities. Fragmentation was done by either RNA hydrolysis or cDNA shearing, and distribution of reads plotted for small (< 1 kb; top), medium (1-8 kb; middle) and large (> 8 kb; bottom) transcripts. Image modified from Huang et al. [4].
This makes the identification of the true start and end of novel transcripts a challenge, as well as underestimating expression level of short genes. Second, PCR can introduce bias based on GC content and length due to non-linear amplification [7, 8]. A number of data analysis tools to correct these biases are available, although achieving varying degrees of success [6, 9, 10].
https://github.com/biod/sambamba
SAMBAMBA-high performance highly parallel robust and fast tool for working with SAM and BAM files
function: view, index, sort, markdup, and depth.
flagstat : 1.4x faster than samtools.
Index : similar.
Markdup : ~ 6x faster
View : ~4x faster
Sort : Sambamba has been beaten, though sambamba is notably
up to 2x faster than samtools on large RAM machines (120GB+).