https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29

linear modeling for RNA-seq count data

Abstract
New normal linear modeling strategies are presented for analyzing read counts from RNA-seq experiments. The voom method estimates the mean-variance relationship of the log-counts, generates a precision weight for each observation and enters these into the limma empirical Bayes analysis pipeline. This opens access for RNA-seq analysts to a large body of methodology developed for microarrays. Simulation studies show that voom performs as well or better than count-based RNA-seq methods even when the data are generated according to the assumptions of the earlier methods. Two case studies illustrate the use of linear modeling and gene set testing methods

How do you read from stdin in Python?

How do you read from stdin in Python?

https://stackoverflow.com/questions/1450393/how-do-you-read-from-stdin-in-python

Here's from Learning Python:

import sys
data = sys.stdin.readlines()
print "Counted", len(data), "lines."

On Unix, you could test it by doing something like:

% cat countlines.py | python countlines.py 
Counted 3 lines.

On Windows or DOS, you'd do:

C:\> type countlines.py | python countlines.py 
Counted 3 lines.

BED file handling software : bedtools, BEDOPS

bedtools: a powerful toolset for genome arithmetic

http://bedtools.readthedocs.io/en/latest/#

bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF. While each individual tool is designed to do a relatively simple task (e.g., intersect two interval files), quite sophisticated analyses can be conducted by combining multiple bedtools operations on the UNIX command line.

BEDOPS: the fast, highly scalable and easily-parallelizable genome analysis toolkit¶

https://bedops.readthedocs.io/en/latest/index.html

BEDOPS is an open-source command-line toolkit that performs highly efficient and scalable Boolean and other set operations, statistical calculations, archiving, conversion and other management of genomic data of arbitrary scale. Tasks can be easily split by chromosome for distributing whole-genome analyses across a computational cluster.

Data conversion

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 157: ordinal not in range(128)

http://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/

https://stackoverflow.com/questions/21129020/how-to-fix-unicodedecodeerror-ascii-codec-cant-decode-byte

https://stackoverflow.com/questions/24358361/removing-u2018-and-u2019-character

>>> for row in query.rows():
... output.write(str(row["primaryIdentifier"])+'\t'+str(row["symbol"])+'\t'+str(row["briefDescription"])+'\t'+str(row["isObsolete"])+'\t'+str(row["description"])+'\t'+str(row["curatorSummary"])+'\n')
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 157: ordinal not in range(128)

Solution

1) Change the default encoding of the whole script to be 'UTF-8',

# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')

2) replace them with their ASCII equivalent

>>> print u"\u2018Hi\u2019"
‘Hi’
>>> print u"\u2018Hi\u2019".replace(u"\u2018", "'").replace(u"\u2019", "'")
'Hi'

Alternatively with regex:

import re
s = u"\u2018Hi\u2019"
>>> print re.sub(u"(\u2018|\u2019)", "'", s)
'Hi'

NGS scrap

page

Apr 18, 2018

RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR

voom: precision weights unlock linear model analysis tools for RNA-seq read counts