NGS scrap: 2022

Dec 22, 2022

HowTo: Access SRA Data

https://github.com/ncbi/sra-tools/wiki/HowTo:-Access-SRA-Data

use the tool prefetch included in the SRA Toolkit.

xample of prefetch usage:

$ prefetch SRR1482462
Maximum file size download limit is 20,971,520KB

2015-02-19T13:20:06 prefetch.2.4.4: 1) Downloading 'SRR1482462'...
2015-02-19T13:20:06 prefetch.2.4.4:  Downloading via fasp...
2015-02-19T13:20:32 prefetch.2.4.4:  fasp download succeed
2015-02-19T13:20:32 prefetch.2.4.4: 1) 'SRR1482462' was downloaded successfully
2015-02-19T13:20:35 prefetch.2.4.4: 'SRR1482462' has 22 dependencies
2015-02-19T13:20:36 prefetch.2.4.4: 2) Downloading 'ncbi-acc:NC_000067.5?vdb-ctx=refseq'...
2015-02-19T13:20:36 prefetch.2.4.4:  Downloading via fasp...
2015-02-19T13:20:41 prefetch.2.4.4:  fasp download succeed
2015-02-19T13:20:41 prefetch.2.4.4: 2) 'ncbi-acc:NC_000067.5?vdb-ctx=refseq' was downloaded successfully
2015-02-19T13:20:41 prefetch.2.4.4: 3) Downloading 'ncbi-acc:NC_000068.6?vdb-ctx=refseq'...
2015-02-19T13:20:41 prefetch.2.4.4:  Downloading via fasp...
2015-02-19T13:20:46 prefetch.2.4.4:  fasp download succeed
2015-02-19T13:20:46 prefetch.2.4.4: 3) 'ncbi-acc:NC_000068.6?vdb-ctx=refseq' was downloaded successfully
2015-02-19T13:20:46 prefetch.2.4.4: 4) Downloading 'ncbi-acc:NC_000069.5?vdb-ctx=refseq'...
2015-02-19T13:20:46 prefetch.2.4.4:  Downloading via fasp...
2015-02-19T13:20:51 prefetch.2.4.4:  fasp download succeed
2015-02-19T13:20:51 prefetch.2.4.4: 4) 'ncbi-acc:NC_000069.5?vdb-ctx=refseq' was downloaded successfully
...

As can be seen from the output above, prefetch performs several steps:

check the size of the file being downloaded
If the file is very large, prefetch must be given a higher download limit, e.g.:
$ prefetch --max-size 100000000 SRR1482462
download the requested file
The file is downloaded using Aspera if available on your system, or HTTPS otherwise.
put the file into its proper place
The file is downloaded into your designated cache area. This permits VDB name resolution to work as designed.
recursively download missing external reference sequences
Most SRA files require additional sequence files in order to reconstruct original reads. prefetch ensures that you not only download the main file but all of its dependencies.
access dbGaP encrypted data
prefetch will make use of download and decryption keys that have been added to SRA Toolkit configuration to obtain authorization for the download in addition to performing all of the steps above. (N.B. In order to access dbGaP data, you will need to change directory or "cd" to the dbGaP project's workspace.)

prefetch will also operate on existing, previously downloaded files to recursively download any missing external reference sequences.

Nov 28, 2022

Awk If Statement Examples

https://www.thegeekstuff.com/2010/02/awk-conditional-statements/

if

$ awk '{
if ($3 =="" || $4 == "" || $5 == "")
	print "score of the student",$1,"is missing";'
}'

if else

$ awk '{
if ($3 >=80 && $4 >= 80 && $5 >= 80)
	print $0,"=>","Pass";
else
	print $0,"=>","Fail";
}

else if

$ cat calc_grade.awk
{
total=$3+$4+$5;
mean=total/3;
if ( mean >= 90 ) grade="A";
else if ( mean >= 80) grade ="B";
else if (mean >= 70) grade ="C";
else grade="D";

print $0,"=>",grade;
}

 $ awk -f calc_grade.awk student-recort
AAA 2111 70 80 75 => C
BBB 2123 60 55 40 => D
CCC 2212 40 42 => D
DDD 2313 88 98 91 => A
EEE 2411 30 45 => D

Jul 1, 2022

UMI : Unique Molecular Identifier, What and Why?

What are UMIs and why are they used in high-throughput sequencing?

https://dnatech.genomecenter.ucdavis.edu/faqs/what-are-umis-and-why-are-they-used-in-high-throughput-sequencing/

Software:
UMI-Tools: https://github.com/CGATOxford/UMI-tools
zUMIs: https://github.com/sdparekh/zUMIs
fastp: https://github.com/OpenGene/fastp (transfer of UMIs into read IDs)

Fu, Y., Wu, PH., Beane, T. et al. Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers. BMC Genomics 19, 531 (2018).

https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4933-1

A higher number of unique combinations can be achieved simply by increasing the number of random-nucleotide positions. The number of UMI combinations must be sufficiently large because as mentioned above, the chance that two cDNA molecules with identical sequences in the starting pool are tagged with the same UMI combination needs to be infinitesimally small.

cyvcf2 : cython + htslib built for fast parsing of Variant Call Format (VCF)

https://github.com/brentp/cyvcf2

cyvcf2 is a cython wrapper around htslib built for fast parsing of Variant Call Format (VCF) files.

Mar 31, 2022

Download FASTQ files from European Nucleotide Archive (ENA)

https://github.com/wwood/ena-fast-download

Requirements

aspera client : https://downloads.asperasoft.com/en/downloads/8?list
curl
Python 3

# set path for aspera. check your aspera directory path
PATH=$PATH:/home/lee/.aspera/connect/bin
export PATH

usage: ena-fast-download.py [-h] [--output_directory OUTPUT_DIRECTORY]
[--ssh_key SSH_KEY ( for OSX) ]
run_identifier

ena-fast-download.py --output_directory /output/directory ERR1739691

Babyplots : interactive 3D graphs

Babyplots Documentation

Babyplots is an easy to use library for creating interactive 3d graphs for exploring and presenting data.

Babyplots is available as a JavaScript library, as an R package, as a Python package, and as an add-in for Microsoft PowerPoint. While the R package, Python package and JavaScript library allow the creation of new plots, the PowerPoint add-in can only be used to display exported plots. This website also provides an interactive node-based editor for creating babyplots visualizations called NPC (node plot creator) or simply Creator.

Find the individual documentation pages through the links below:

Dragging from input nodes

https://bp.bleb.li/documentation/

Feb 24, 2022

How to Use t-SNE Effectively

https://distill.pub/2016/misread-tsne/

1. Hyperparameter

- perplexity values in the range (5 - 50) suggested

- iterate until reaching a stable configuration.

2. Cluster sizes in a t-SNE plot mean nothing

- expands dense clusters, and contracts sparse ones, evening out cluster sizes

3. Distances between clusters might not mean anything

- may not be one perplexity value that captures distances across all clusters

- perplexity is a global parameter.

4. Random noise doesn’t always look random.

- need to do in various perplexity values

5. You can see some shapes, sometimes

- need to do in various perplexity values

6. For topology, you may need more than one plot

- need to do in various perplexity values

Jan 26, 2022

PlantSeg : tool for cell instance aware segmentation in densely packed 3D volumetric images.

https://github.com/hci-unihd/plant-seg

Install PlantSeg

conda create -n plant-seg -c pytorch -c conda-forge cudatoolkit=10.1 -c lcerrone -c abailoni -c cpape -c awolny pytorch nifty=1.0.9 plantseg

To install pytorch for a certain cudatoolkit version

conda install pytorch  cudatoolkit=10.1 -c pytorch
plantseg --gui

To designate a certain cuda GPU device when run plant-seq with

CUDA_VISIBLE_DEVICES=0 plantseg --gui

To check cuda is available in pytorch

import torch
torch.cuda.is_available()

To check GPU usage

nvidia-smi -l 1

To check CUDA version

nvcc -V

How To Use GPU with PyTorch

https://wandb.ai/wandb/common-ml-errors/reports/How-To-Use-GPU-with-PyTorch---VmlldzozMzAxMDk

How to run python code from Terminal in multiple sessions with multiple GPUs

1. set CUDA device

$ CUDA_VISIBLE_DEVICES=0 python test1.py # Uses GPU 0.

$ CUDA_VISIBLE_DEVICES=1 python test2.py # Uses GPU 1.

$ CUDA_VISIBLE_DEVICES=2,3 python test3.py # Uses GPUs 2 and 3.

2. add in python code

import os

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"

os.environ["CUDA_VISIBLE_DEVICES"]="0"

ref:

https://stackoverflow.com/questions/34775522/tensorflow-multiple-sessions-with-multiple-gpus

https://stackoverflow.com/questions/37893755/tensorflow-set-cuda-visible-devices-within-jupyter