Jul 23, 2021

SuperExactTest :R software package for multi-set intersection test & visualization

Efficient Test and Visualization of Multi-Set Intersections

Scientific Reports volume 5, Article number: 16923 (2015)

https://www.nature.com/articles/srep16923

Jul 22, 2021

How to compute TPMs with TMM factors (edgeR)?

https://www.biostars.org/p/388584/

TMM is a method for normalizing the library sizes rather than a method for normalizing read counts. As the edgeR User's Guide says (page 15):

normalization in edgeR is model-based, and the original read counts are not themselves transformed.

Which way around is your question? Do you have TPMs and want to compute TMM factors or do you have TMM factors and want to compute TPMs?

If you are asking the first question, then no, TMM factors can only be computed from the raw counts, not from quantities such as TPMs or CPMs from which the library sizes have already been divided out. If you already have TPMs from some software package, then normalization has almost certainly already been applied, so I would be very wary about trying to re-normalize them unless you really know what you're doing.

If you are asking the second question then, yes, TMM factors can in principle be used to compute TPMs. In edgeR, any downstream quantity that is computed from the library sizes will incorporate the TMM factors automatically, because the factors are considered part of the effective library sizes. TMM normalization factors will be applied automatically when you use

CPM <- cpm(dge)

RPKM <- rpkm(dge)

in edgeR to compute CPMs or RPKMs from a DGEList object. I don't necessarily recommend TPM values myself, but if you go on to compute TPMs by

TPM <- t( t(RPKM) / colSums(RPKM) ) * 1e6

then the TMM factors will naturally have been incorporated into the computation.

5' end bias in RNA-seq

http://seqanswers.com/forums/showthread.php?t=13902

The bias will affect estimates of absolute expression, but once you calculate a fold change for a gene by comparing several samples, it should cancel out.

This holds if the patterns are the same in all samples. If they are not, you might get better results when adjusting for it. This is at least what Hansen et al. claim in their follow-up paper, a preprint of which you can find here: http://www.bepress.com/jhubiostat/paper227/

Good to read : cover broad issue in RNA-seq

https://www.labome.com/method/RNA-seq.html

Currently, all commercially available RNA-seq platforms rely on reverse transcription and PCR amplification prior to sequencing and sequencing is therefore subject to the biases inherent to these procedures. First, annealing of random hexamer primers to fragmented RNA is not random, which results in depletion of reads at both 5’ and 3’ ends [3-6]

RNA-seq figure 2

Figure 2. Sequence logo showing observed and expected nucleotide distribution surrounding the 5’ fragmentation site. Similar biases are present at the 3’ end. Image: Roberts et al. [3] (image released under a Creative Commons Attribution License).

RNA-seq figure 3

Figure 3. Read coverage over genes is biased against 3’ and 5’ extremities. Fragmentation was done by either RNA hydrolysis or cDNA shearing, and distribution of reads plotted for small (< 1 kb; top), medium (1-8 kb; middle) and large (> 8 kb; bottom) transcripts. Image modified from Huang et al. [4].

This makes the identification of the true start and end of novel transcripts a challenge, as well as underestimating expression level of short genes. Second, PCR can introduce bias based on GC content and length due to non-linear amplification [7, 8]. A number of data analysis tools to correct these biases are available, although achieving varying degrees of success [6, 9, 10].