NGS scrap: 2017

Aug 8, 2017

R - ggplot2

data visualization with ggplot2

https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

Aug 3, 2017

nohup - running program as background in linux

nohup ./workflow.sh & # running as background

output is saved to nohup.out

for terminate nohup

kill -9 PID number # you can check PID number with 'top' command

Installing tensorflow - pip upgrade

for details, check https://www.tensorflow.org/install/install_linux

if step 4 failed, try upgrade pip

(tensorflow)$ pip install --upgrade pip

then try step 4 again

Installing with virtualenv

Take the following steps to install TensorFlow with Virtualenv:

Install pip and virtualenv by issuing one of the following commands:





$ sudo apt-get install python-pip python-dev python-virtualenv # for Python 2.7
$ sudo apt-get install python3-pip python3-dev python-virtualenv # for Python 3.n

Create a virtualenv environment by issuing one of the following commands:
```
$ virtualenv --system-site-packages targetDirectory # for Python 2.7
$ virtualenv --system-site-packages -p python3 targetDirectory # for Python 3.n
```
where targetDirectory specifies the top of the virtualenv tree. Our instructions assume that targetDirectoryis ~/tensorflow, but you may choose any directory.

Activate the virtualenv environment by issuing one of the following commands:





$ source ~/tensorflow/bin/activate # bash, sh, ksh, or zsh
 $ source ~/tensorflow/bin/activate.csh  # csh or tcsh

The preceding source command should change your prompt to the following:





(tensorflow)$

Issue one of the following commands to install TensorFlow in the active virtualenv environment:





(tensorflow)$ pip install --upgrade tensorflow      # for Python 2.7
 (tensorflow)$ pip3 install --upgrade tensorflow     # for Python 3.n
 (tensorflow)$ pip install --upgrade tensorflow-gpu  # for Python 2.7 and GPU
 (tensorflow)$ pip3 install --upgrade tensorflow-gpu # for Python 3.n and GPU

If the preceding command succeeds, skip Step 5. If the preceding command fails, perform Step 5.

(Optional) If Step 4 failed (typically because you invoked a pip version lower than 8.1), install TensorFlow in the active virtualenv environment by issuing a command of the following format:
```
(tensorflow)$ pip install --upgrade tfBinaryURL   # Python 2.7
 (tensorflow)$ pip3 install --upgrade tfBinaryURL  # Python 3.n 
```
where tfBinaryURL identifies the URL of the TensorFlow Python package. The appropriate value oftfBinaryURLdepends on the operating system, Python version, and GPU support. Find the appropriate value fortfBinaryURL for your system here. For example, if you are installing TensorFlow for Linux, Python 2.7, and CPU-only support, issue the following command to install TensorFlow in the active virtualenv environment:

Jun 19, 2017

SRA Toolkit

update : 20220916

update your SRA toolkit

https://github.com/ncbi/sra-tools/wiki

https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc

SRA Toolkit Installation and Configuration Guide

Protected Data Usage Guide

Frequently Used Tools:

fastq-dump: Convert SRA data into fastq format

prefetch: Allows command-line downloading of SRA, dbGaP, and ADSP data

sam-dump: Convert SRA data to sam format

sra-pileup: Generate pileup statistics on aligned SRA data

vdb-config: Display and modify VDB configuration information

vdb-decrypt: Decrypt non-SRA dbGaP data ("phenotype data")

Additional Tools:

abi-dump: Convert SRA data into ABI format (csfasta / qual)

illumina-dump: Convert SRA data into Illumina native formats (qseq, etc.)

sff-dump: Convert SRA data to sff format

sra-stat: Generate statistics about SRA data (quality distribution, etc.)

vdb-dump: Output the native VDB format of SRA data.

vdb-encrypt: Encrypt non-SRA dbGaP data ("phenotype data")

vdb-validate: Validate the integrity of downloaded SRA data

https://edwards.sdsu.edu/research/fastq-dump/

good review for how to use fastq-dump option

https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump

fastq-dump: Convert SRA data into fastq format

Usage:

fastq-dump [options] <path/file> [<path/file> ...]

fastq-dump [options] <accession>

Frequently Used Options:

General:
-h	\|	--help	Displays ALL options, general usage, and version information.
-V	\|	--version	Display the version of the program.
Data formatting:
		--split-files	Dump each read into separate file. Files will receive suffix corresponding to read number.
		--split-spot	Split spots into individual reads.
		--fasta <[line width]>	FASTA only, no qualities. Optional line wrap width (set to zero for no wrapping).
-I	\|	--readids	Append read id after spot id as 'accession.spot.readid' on defline.
-F	\|	--origfmt	Defline contains only original sequence name.
-C	\|	--dumpcs <[cskey]>	Formats sequence using color space (default for SOLiD). "cskey" may be specified for translation.
-B	\|	--dumpbase	Formats sequence using base space (default for other than SOLiD).
-Q	\|	--offset <integer>	Offset to use for ASCII quality scores. Default is 33 ("!").
Filtering:
-N	\|	--minSpotId <rowid>	Minimum spot id to be dumped. Use with "X" to dump a range.
-X	\|	--maxSpotId <rowid>	Maximum spot id to be dumped. Use with "N" to dump a range.
-M	\|	--minReadLen <len>	Filter by sequence length >= <len>
		--skip-technical	Dump only biological reads.
		--aligned	Dump only aligned sequences. Aligned datasets only; see sra-stat.
		--unaligned	Dump only unaligned sequences. Will dump all for unaligned datasets.
Workflow and piping:
-O	\|	--outdir <path>	Output directory, default is current working directory ('.').
-Z	\|	--stdout	Output to stdout, all split data become joined into single stream.
		--gzip	Compress output using gzip.
		--bzip2	Compress output using bzip2.

Use examples:

fastq-dump -X 5 -Z SRR390728

Prints the first five spots (-X 5) to standard out (-Z). This is a useful starting point for verifying other formatting options before dumping a whole file.

fastq-dump -I --split-files SRR390728

Produces two fastq files (--split-files) containing ".1" and ".2" read suffices (-I) for paired-end data.

fastq-dump --split-files --fasta 60 SRR390728

Produces two (--split-files) fasta files (--fasta) with 60 bases per line ("60" included after --fasta).

fastq-dump --split-files --aligned -Q 64 SRR390728

Produces two fastq files (--split-files) that contain only aligned reads (--aligned; Note: only for files submitted as aligned data), with a quality offset of 64 (-Q 64) Please see the documentation on vdb-dump if you wish to produce fasta/qual data.

Possible errors and their solution:

fastq-dump.2.x err: item not found while constructing within virtual database module - the path '<path/SRR*.sra>' cannot be opened as database or table

This error indicates that the .sra file cannot be found. Confirm that the path to the file is correct.

fastq-dump.2.x err: name not found while resolving tree within virtual file system module - failed SRR*.sra

The data are likely reference compressed and the toolkit is unable to acquire the reference sequence(s) needed to extract the .sra file. Please confirm that you have tested and validated the configuration of the toolkit. If you have elected to prevent the toolkit from contacting NCBI, you will need to manually acquire the reference(s) here

Jun 3, 2017

python - make a time delay

https://stackoverflow.com/questions/510348/how-can-i-make-a-time-delay-in-python

import time
time.sleep(5) # delays for 5 seconds

Here is another example where something is run once a minute:

import time 
while True:
    print "This prints once a minute."
    time.sleep(60)  # Delay for 1 minute (60 seconds)

python - os.system, subprocess to spawn new processes

https://docs.python.org/2/library/subprocess.html

https://stackoverflow.com/questions/18739239/python-how-to-get-stdout-after-running-os-system?noredirect=1&lq=1

https://stackoverflow.com/questions/3791465/python-os-system-for-command-line-call-linux-not-returning-what-it-should

import os
os.system('ls')

from subprocess import call
call('ls')

Apr 13, 2017

TOP command examples on Linux to monitor processes

for more information, visit
http://www.binarytides.com/linux-top-command/

$ top

top - 18:50:35 up  9:05,  5 users,  load average: 0.68, 0.52, 0.39
Tasks: 254 total,   1 running, 252 sleeping,   0 stopped,   1 zombie
%Cpu(s):  2.3 us,  0.5 sy,  0.0 ni, 97.1 id,  0.2 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   8165300 total,  6567896 used,  1597404 free,   219232 buffers
KiB Swap:  1998844 total,        0 used,  1998844 free.  2445372 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND             
17952 enlight+  20   0 1062096 363340  88068 S   4.8  4.4   0:49.33 chrome              
14294 enlight+  20   0  954752 203548  61404 S   2.1  2.5   2:00.91 chrome              
 1364 root      20   0  519048 105704  65348 S   0.6  1.3  17:31.27 Xorg                
19211 enlight+  20   0  576608  47216  39136 S   0.6  0.6   0:01.01 konsole             
   13 root      rt   0       0      0      0 S   0.3  0.0   0:00.10 watchdog/1          
   25 root      20   0       0      0      0 S   0.3  0.0   0:03.49 rcuos/2             
 1724 enlight+  20   0  430144  36456  32608 S   0.3  0.4   0:03.60 akonadi_contact     
 1869 enlight+  20   0  534708  52700  38132 S   0.3  0.6   0:53.94 yakuake             
14040 enlight+  20   0  858176 133944  61152 S   0.3  1.6   0:09.89 chrome

USER - The system user account running the process.
%CPU - CPU usage by the process.
%MEM - Memory usage by the process
COMMAND - The command (executable file) of the process

Display full command path and arguments of process - 'c'

Press 'c' to display the full command path along with the commandline arguments in the COMMAND column.

%CPU %MEM     TIME+ COMMAND                                                    
  0.0  0.0   0:00.00 /usr/bin/dbus-launch --exit-with-session /usr/bin/im-laun+ 
  0.0  0.1   0:01.52 /usr/bin/dbus-daemon --fork --print-pid 5 --print-address+ 
  0.0  0.3   0:00.41 /usr/bin/kwalletd --pam-login 17 20                        
  0.0  0.0   0:00.00 /usr/lib/x86_64-linux-gnu/libexec/kf5/start_kdeinit --kde+ 
  0.0  0.3   0:01.55 klauncher [kdeinit5] --fd=9                                
  0.0  0.2   0:00.13 /usr/lib/telepathy/mission-control-5                       
  0.0  0.1   0:00.00 /usr/lib/dconf/dconf-service                               
  0.0  0.4   0:01.41 /usr/lib/x86_64-linux-gnu/libexec/kdeconnectd              
  0.0  0.2   0:01.09 /usr/lib/x86_64-linux-gnu/libexec/kf5/kscreen_backend_lau+

Display all CPU cores - '1'

Pressing '1' will display the load information about individual cpu cores. Here is how it looks -

top - 10:45:47 up  1:42,  5 users,  load average: 0.81, 1.14, 0.94
Tasks: 260 total,   2 running, 257 sleeping,   0 stopped,   1 zombie
%Cpu0  :  3.6 us,  3.6 sy,  0.0 ni, 92.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  3.1 us,  3.6 sy,  0.0 ni, 93.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  7.6 us,  1.8 sy,  0.0 ni, 90.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  9.6 us,  2.6 sy,  0.0 ni, 87.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   8165300 total,  7118864 used,  1046436 free,   204224 buffers
KiB Swap:  1998844 total,        0 used,  1998844 free.  3410364 cached Mem

Batch mode

Top also supports batch mode output, where it would keep printing information sequentially instead of a single screen. This is useful when you need to log the top output for later analysis of some kind.

Here is a simple example that shows the Cpu usage at intervals of 1 second.

$ top -d 1.0 -b | grep Cpu

Linux find command

for more information, visit
http://www.binarytides.com/linux-find-command-examples/

basic syntax
$ find location comparison-criteria search-term

searches for files by their name

$ find ./test -name "abc.txt"
./test/abc.txt

wildcards

$ find ./test -name "*.php"
./test/subdir/how.php
./test/cool.php

all sub directories are searched recursively. So this is a very powerful way to find all files of a given extension.

Trying to search the "/" directory which is the root, would search the entire file system including mounted devices and network storage devices. So be careful. Of course you can press Ctrl + c anytime to stop the command.

NGS scrap

page