Oct 25, 2019

R : EOF within quoted string - read.delim() vs read.table()

https://kbroman.org/blog/2017/08/08/eof-within-quoted-string/

read.table() uses quote="'\"" (that is, looking for either single- or double-quotes) read.delim() uses quote="\"" (just looking for double-quotes).

To solve quote problem,
In read.table(), use quote="", and for that matter also fill=FALSE

Oct 14, 2019

SAM Format - SAM flags description

https://www.samformat.info/sam-format-flag-single

SAM Format - SAM flags description

- if you type in flag number, it shows descriptions of the flag number

transpose a file with awk

an-efficient-way-to-transpose-a-file-in-bash

Q. transpose a file

X column1 column2 column3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11

to

X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11

A.

awk '
{ 
    for (i=1; i<=NF; i++)  {
        a[NR,i] = $i
    }
}
NF>p { p = NF }
END {    
    for(j=1; j<=p; j++) {
        str=a[1,j]
        for(i=2; i<=NR; i++){
            str=str" "a[i,j]; # check FS (ex: space, tab, etc)
        }
        print str
    }
}' file

Oct 8, 2019

MATLAB - Data label on each point in xy scatter

https://stackoverflow.com/questions/7100398/data-label-on-each-entry-in-xy-scatter

p = rand(10,2);
scatter(p(:,1), p(:,2), 'filled')
axis([0 1 0 1])

labels = num2str((1:size(p,1))','%d');    %'
text(p(:,1), p(:,2), labels, 'horizontal','left', 'vertical','bottom','Fontsize', 7)

Sep 20, 2019

regular expression generator

websites for regular expression test

http://www.txt2re.com/index-python.php3

https://regex101.com/

Sep 16, 2019

How to copy files from one machine to another using ssh [duplicate]

https://unix.stackexchange.com/questions/106480/how-to-copy-files-from-one-machine-to-another-using-ssh

Syntax:

scp <source> <destination>

To copy a file from B to A while logged into B:

scp /path/to/file username@a:/path/to/destination

To copy a file from B to A while logged into A:

scp username@b:/path/to/file /path/to/destination

IGV : tracks for normalize coverage Data (need igvtools to make .tdf files)

https://software.broadinstitute.org/software/igv/igvtools
https://software.broadinstitute.org/software/igv/igvtools_commandline

The igvtools utility provides a set of tools for pre-processing data files. File names must contain an accepted file extension, e.g. test-xyz.bam. Tools include:

toTDF
Converts a sorted data input file to a binary tiled data (.tdf) file.
Used to preprocess large datasets for improved IGV performance.
Supported input file formats: .cn, .gct, .igv, .res, .snp, and .wig
Note: This tool was previously known as tile

count
Computes average alignment or feature density for over a specified window size across the genome and outputs a binary tiled data .tdf file, text .wig file, or both depending on inputs.
Used to create a track that can be displayed in IGV, for example as a bar chart.
Supported input file formats: .aligned, .bam, .bed, .psl, .pslx, and .sam

index
Creates an index file for an ASCII alignment or feature file.
Index files are required for loading alignment files into IGV, and can significantly improve performance for large feature files. The file must be in sorted order by start position.
Supported input file formats: .aligned, .bed, .psl, .sam, .bam, and .vcf (v3.2)

sort
Sorts the input file by start position.
Used to prepare data files for tools that required sorted input files.
Supported input file formats: .aligned, .bed, .cn, .igv, .psl, .sam, .bam, and .vcf

From IGV: igvtools is accessed by selecting Tools>Run igvtools.

igvtools count

Supported input file formats are: .sam, .bam, .aligned, .psl, .pslx, and .bed.

Usage:
igvtools count [options] [inputFile] [outputFile] [genome]

The input file must be sorted by start position

http://software.broadinstitute.org/software/igv/book/export/html/6

To load normalized coverage track to IGV

View -> Preferences -> Tracks

Select to normalize tracks containing coverage data in .tdf files that were created using igvtools. This normalization option multiplies each value by [1,000,000 / (totalReadCount)].

This is only available for .tdf files created using igvtools builds dated 1/28/2010 or later. Earlier versions of igvtools did not record the total read count.
This selection takes effect with new sessions loaded afterwards.

Installing CMake

https://cmake.org/install/
https://cmake.org/download/
https://gist.github.com/1duo/38af1abd68a2c7fe5087532ab968574e#file-centos-install-cmake-from-source-md

Compile from source and install

tar zxvf cmake-3.*
cd cmake-3.*
./bootstrap --prefix=/usr/local
make
make install

Validate installation

cmake --version

cmake version *.*.*
CMake suite maintained and supported by Kitware (kitware.com/cmake).

Install, uninstall RPM on linux (CentOS, RHEL, Fedora)

https://phoenixnap.com/kb/how-to-install-rpm-file-centos-linux

Install RPM on Linux

sudo rpm –i sample_file.rpm

sudo yum localinstall sample_file.rpm

Remove RPM Package

sudo rpm –e sample_file.rpm

Sep 4, 2019

Using grep -v on multiple arguments

https://www.thegeekstuff.com/2011/10/grep-or-and-not-operators

Using grep -v on multiple arguments

grep -v -e foo -e bar file

grep -Ev 'word1|word2'

Aug 27, 2019

Index a string in linux

https://unix.stackexchange.com/questions/303960/index-a-string-in-bash

Substring Extraction
${string:position}
Extracts substring from $string at $position.

If the $string parameter is "*" or "@", then this extracts the positional parameters, starting at $position.
${string:position:length}
Extracts $length characters of substring from $string at $position.

stringZ=abcABC123ABCabc
#       0123456789.....
#       0-based indexing.

echo ${stringZ:0}                       # abcABC123ABCabc
echo ${stringZ:1}                       # bcABC123ABCabc
echo ${stringZ:7}                       # 23ABCabc 

echo ${stringZ:7:3}                     # 23A
                                        # Three characters of substring.


# Is it possible to index from the right end of the string?

echo ${stringZ:-4}                      # abcABC123ABCabc
# Defaults to full string, as in ${parameter:-default}.
# However . . . 

echo ${stringZ:(-4)}                    # Cabc
echo ${stringZ: -4}                     # Cabc
# Now, it works.
# Parentheses or added space "escape" the position parameter.

Aug 1, 2019

Extract Reads From a Bam File That Fall Within A Given Region

https://www.biostars.org/p/48719/

# -h : include header
# file should be indexed.bam

samtools view -h input.indexed.bam "Chr1:10000-20000" > output.sam

Jul 30, 2019

TypeError : 'byte' and 'str'

in Python3, 'byte' type was introduced.
It might make TypeErr if run in Python2 and Python3

https://stackoverflow.com/questions/33054527/typeerror-a-bytes-like-object-is-required-not-str-when-writing-to-a-file-in

you can convert 'byte' to 'str'

strType=byteType.decode("utf-8")

https://stackoverflow.com/questions/606191/convert-bytes-to-a-string

Jul 26, 2019

python numpy.isin - implement of matlab's ismember(A, B)

https://docs.scipy.org/doc/numpy/reference/generated/numpy.isin.html?highlight=numpy%20isin#numpy.isin

mask = np.isin(A, B)

idx=np.nonzere(mask) # return index of element A which is in element B

numpy.isin(element, test_elements, assume_unique=False, invert=False)[source]

Calculates element in test_elements, broadcasting over element only. Returns a boolean array of the same shape as element that is True where an element of element is in test_elements and False otherwise.

Parameters:

Parameters:	element : array_like Input array. test_elements : array_like The values against which to test each value of element. This argument is flattened if it is an array or array_like. See notes for behavior with non-array-like parameters. assume_unique : bool, optional If True, the input arrays are both assumed to be unique, which can speed up the calculation. Default is False. invert : bool, optional If True, the values in the returned array are inverted, as if calculating element not in test_elements. Default is False. `np.isin(a, b, invert=True)` is equivalent to (but faster than) `np.invert(np.isin(a, b))`.
Returns:	isin : ndarray, bool Has the same shape as element. The values element[isin] are in test_elements.

element : array_like: Input array.
test_elements : array_like: The values against which to test each value of element. This argument is flattened if it is an array or array_like. See notes for behavior with non-array-like parameters.
assume_unique : bool, optional: If True, the input arrays are both assumed to be unique, which can speed up the calculation. Default is False.
invert : bool, optional: If True, the values in the returned array are inverted, as if calculating element not in test_elements. Default is False. np.isin(a, b, invert=True) is equivalent to (but faster than) np.invert(np.isin(a, b)).

Returns:

isin : ndarray, bool: Has the same shape as element. The values element[isin] are in test_elements.

Linux comm command

http://www.unixcl.com/2009/08/linux-comm-command-brief-tutorial.html

From COMM(1) man page, the options available are:

-1 suppress lines unique to FILE1
-2 suppress lines unique to FILE2
-3 suppress lines that appear in both files

comm - compare two sorted files line by line

comm <(sort a.txt) <(sort b.txt)

Jul 11, 2019

Sed Command in Linux

Replacing all the occurrence of the pattern in a line : The substitute flag /g (global replacement) specifies the sed command to replace all the occurrences of the string in the line.

$sed 's/apple/lemon/g' test.txt

https://www.tecmint.com/linux-sed-command-tips-tricks/

https://www.geeksforgeeks.org/sed-command-in-linux-unix-with-examples/

https://likegeeks.com/sed-linux/

Jul 5, 2019

Redirect stderr and stdout in Bash

https://stackoverflow.com/questions/637827/redirect-stderr-and-stdout-in-bash

You can redirect stderr to stdout and the stdout into a file:

some_command >file.log 2>&1

To append file
echo "foo" 1>> bar.txt 2>&1

FastQC read duplication report

clear description of fastQC read duplication report and example

https://www.biostars.org/p/107402/

May 31, 2019

Differential expression of FPKM from RNA-seq data using limma and voom()

Differential expression of FPKM from RNA-seq data using limma and voom()

https://support.bioconductor.org/p/56275/

Q.

I have a count matrix of FPKM values and I want to estimate differentially expressed genes between two conditions. First I used DESeq2, but I realized that this is not good for FPKM values. I then transformed the counts using voom() in the limma package and then used:

fit <- lmFit(myVoomData,design)
fit <- eBayes(fit)
options(digits=3)
writefile = topTable(fit,n=Inf,sort="none", p.value=0.01)
write.csv(writefile, file="file.csv")

My problem is that all of the 6156 genes are differentially expressed (p-value 0.01). Only a few hundred were differentially expressed using DESe2, but I guess that can't be trusted.
I am new to this type of analysis, and to R, but is it ok to simply transform the data by voom()? Can I use the transformed data in DESeq2? Any other ways I can use FPKM counts to estimate differentially expressed genes?

A.

No, it is absolutely not ok to input FPKM values to voom. The results will be nonsense. Same goes for edgeR and DESeq, for the reasons explained in the documentation for those packages.
Note that a matrix of FPKM is not a matrix of counts.
It seems to me that you have three options in decreasing order of desirability:

Get the actual integer counts from which the FPKM were computed and do a proper analysis, for example using voom or edgeR.
Get the gene lengths and library sizes used to compute the FPKM and convert the FPKM back to counts.
If FPKM is really all you have, then convert the values to a log2 scale (y = log2(FPKM+0.1) say) and do an ordinary limma analysis as you would for microarray data, using eBayes() with trend=TRUE. Do not use voom, do not use edgeR, do not use DESeq. (Do not pass go and do not collect $200.) This isn't 100% ideal, but is probably the best analysis available.

The third option is similar to the "limma-trend" analysis described in the limma preprint, except that it is applied to the logFPKM instead of logCPM. Statistically this will not perform as well as it would applied to the logCPM.

What you're doing wrong is trying to do a DE analysis using FPKMs in the first place. There is no good answer to what the FPKM offset should be to avoid taking the log of zero, which is one of the problems with using FPKMs. I would personally use a larger offset than any you have used, certainly not less than 0.1, but then I wouldn't do it at all.
The way to choose a tuning parameter like this is to plot the data to see that it looks vaguely reasonable rather than to count the number of DE genes.
I assume you have filtered the data so that you are not trying to analyse genes for which all, or almost all, the FPKM are zero. Having lots of FPKM=0 would lead to the warnings about zero variances.
I also suggest you used eBayes with robust=TRUE as well as trend=TRUE. That will reduce the sensitivity to zero variances.

How do I use shell variables in an awk script?

How do I use shell variables in an awk script?
https://stackoverflow.com/questions/19075671/how-do-i-use-shell-variables-in-an-awk-script

- Using -v (The best way, most portable)

It uses the -v option: (P.S. use a space after -v or it will be less portable. E.g., awk -v var= not awk -vvar=)

variable="line one\nline two"
awk -v var="${variable}" 'BEGIN {print var}'
line one
line two

This should be compatible with most awk and variable is available in the BEGIN block as well:
If you have multiple variables:

awk -v a="${var1}" -v b="${var2}" 'BEGIN {print a,b}'

Calling an external command in Python - subprocess, os.system

How can I call an external command (as if I'd typed it at the Unix shell or Windows command prompt) from within a Python script?

https://stackoverflow.com/questions/89228/calling-an-external-command-in-python/89243#

Here's a summary of the ways to call external programs and the advantages and disadvantages of each:

os.system("some_command with args") passes the command and arguments to your system's shell. This is nice because you can actually run multiple commands at once in this manner and set up pipes and input/output redirection. For example:
```
os.system("some_command < input_file | another_command > output_file")  
```
However, while this is convenient, you have to manually handle the escaping of shell characters such as spaces, etc. On the other hand, this also lets you run commands which are simply shell commands and not actually external programs. See the documentation.
stream = os.popen("some_command with args") will do the same thing as os.system except that it gives you a file-like object that you can use to access standard input/output for that process. There are 3 other variants of popen that all handle the i/o slightly differently. If you pass everything as a string, then your command is passed to the shell; if you pass them as a list then you don't need to worry about escaping anything. See the documentation.
The Popen class of the subprocess module. This is intended as a replacement for os.popen but has the downside of being slightly more complicated by virtue of being so comprehensive. For example, you'd say:
```
print subprocess.Popen("echo Hello World", shell=True, stdout=subprocess.PIPE).stdout.read()
```
instead of:
```
print os.popen("echo Hello World").read()
```
but it is nice to have all of the options there in one unified class instead of 4 different popen functions. See the documentation.
The call function from the subprocess module. This is basically just like the Popen class and takes all of the same arguments, but it simply waits until the command completes and gives you the return code. For example:
```
return_code = subprocess.call("echo Hello World", shell=True)  
```
See the documentation.
If you're on Python 3.5 or later, you can use the new subprocess.run function, which is a lot like the above but even more flexible and returns a CompletedProcess object when the command finishes executing.
The os module also has all of the fork/exec/spawn functions that you'd have in a C program, but I don't recommend using them directly.

The subprocess module should probably be what you use.
Finally please be aware that for all methods where you pass the final command to be executed by the shell as a string and you are responsible for escaping it. There are serious security implications if any part of the string that you pass can not be fully trusted. For example, if a user is entering some/any part of the string. If you are unsure, only use these methods with constants.
To give you a hint of the implications consider this code:

print subprocess.Popen("echo %s " % user_input, stdout=PIPE).stdout.read()

and imagine that the user enters "my mama didnt love me && rm -rf /".

ChIP-seq : read coverage and input

What is the minimum million reads that I need to have a good result of chip-seq in human?

https://www.researchgate.net/post/What_is_the_minimum_million_reads_that_I_need_to_have_a_good_result_of_chip-seq_in_human

What matters is unique reads rather than total reads. ChIP-Seq runs that give 20-30 million unique reads will work quite well for most factors.

How many input samples do you need to add to ChIP-Seq library?

https://www.researchgate.net/post/How_many_input_samples_do_you_need_to_add_to_ChIP-Seq_library

best control is the input split from the IP reaction, but given economic constraints I would argue a pooled input will (on the whole) give similar results.

Question about inputs for ChIP-seq

http://seqanswers.com/forums/showthread.php?t=35377

Coverage depth recommendations

https://www.illumina.com/science/education/sequencing-coverage.htm

Sequencing Coverage Calculator

https://support.illumina.com/downloads/sequencing_coverage_calculator.html

Chromatin immunoprecipitation (ChIP) of plant transcription factors followed by sequencing (ChIP-SEQ) or hybridization to whole genome arrays (ChIP-CHIP)

Nature Protocols volume 5, pages 457–472 (2010)

https://www.nature.com/articles/nprot.2009.244

	Library 1 (GAI)	Library 2 (GAI)	Library 3 (GAII)	Library 4 (GAII)
Number of reads	4,334,840	4,105,326	8,367,681	8,638,004
Number of [A]×20 reads	14,074 (0.3 %)	47,906 (1.2 %)	915 (0.0001%)	840 (0.0001%)
Number of adapter reads	273,576 (6.3%)	13,582 (0.3 %)	1,389,757 (8.3 %)	4,796,178 (27.8%)
Number of QC reads	377,836 (8.7 %)	265,027 (6.4 %)	2,277,630 (27.2 %)	2,821,913 (32.7%)
Number of uniquely mapped reads	1,649,875 (38%)	2,084,386 (51%)	3,898,721 (46.6 %)	736,082 (8.5 %)

ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia

https://genome.cshlp.org/content/22/9/1813.long

https://genohub.com/recommended-sequencing-coverage-by-application/

Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003326

How to tell which library type to use in bowtie, tophat (fr-unstrand, -firststrand -secondstrand)?

How to tell which library type to use (fr-unstrand, -firststrand or secondstrand)?

http://onetipperday.sterding.com/2012/07/how-to-tell-which-library-type-to-use.html

Normalization method of DESeq2, CPM, TPM, FPKM, TMM

Clear description for normalization method and DESeq2 normalization process (recommend)

https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html

Gene expression units explained: RPM, RPKM, FPKM, TPM, DESeq, TMM, SCnorm, GeTMM, and ComBat-Seq : include python, R code (recommend)

https://www.reneshbedre.com/blog/expression_units.html

calculation of CPM, TPM, FPKM ; R code (recommend)

https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/

How to normalized TPM with TMM method?

https://www.biostars.org/p/388584/

->you have TMM factors and want to compute TPMs.

TMM factors can in principle be used to compute TPMs. In edgeR, any downstream quantity that is computed from the library sizes will incorporate the TMM factors automatically, because the factors are considered part of the effective library sizes. TMM normalization factors will be applied automatically when you use

CPM <- cpm(dge)

RPKM <- rpkm(dge)

in edgeR to compute CPMs or RPKMs from a DGEList object. I don't necessarily recommend TPM values myself, but if you go on to compute TPMs by

TPM <- t( t(RPKM) / colSums(RPKM) ) * 1e6

then the TMM factors will naturally have been incorporated into the computation.

TMM (edgeR), RLE (DESeq2), and MRN Normalization Methods comparison

https://www.frontiersin.org/articles/10.3389/fgene.2016.00164/full

Positional sequence bias in random primed libraries (5' bias in RNA-seq)

Positional sequence bias in random primed libraries

https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/

RNA-Seq library with random primer has biased seqeunce in 5' region.

Trimming 5' does not solve this issue

This bias does not seem to have any serious problems for downstream analysis

If some libraries do not show this bias among RNA-seq libraries, those libraries might need double check for quality control.

page