Question: Best bioinfo one-liners?
22
gravatar for Manu Prestat
3.6 years ago by
Manu Prestat3.9k
Marseille, France
Manu Prestat3.9k wrote:

Whereas an infinity of efficient tools exists out there, it is sometimes still quicker for achieving simple tasks to execute a one linux command. I'm starting by sharing 3 I use quite often.

 

##1 get the sequences length distribution form a fastq file using awk
zcat file.fastq.gz | awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}'
##2 Reverse complement a sequence (I use that a lot when I need to design primers)
echo 'ATTGCTATGCTNNNT' | rev | tr 'ACTG' 'TGAC'
##3 split a multifasta file into single ones with csplit:
csplit -z -q -n 4 -f sequence_ sequences.fasta /\>/ {*}

 

I may be wrong, but I've not found such a list in Biostars.

So, what comes to your mind? I hope this post will yield some gold nuggets ;-)

 

linux gnu • 3.3k views
ADD COMMENTlink modified 3.6 years ago by h.mon22k • written 3.6 years ago by Manu Prestat3.9k

Paging umer.zeeshan.ijaz

ADD REPLYlink written 3.6 years ago by Dan D6.6k

I remember I saw this: Useful Bash Commands To Handle Fasta Files

ADD REPLYlink written 3.6 years ago by sentausa630
6
gravatar for Pierre Lindenbaum
3.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum115k wrote:

linearize fasta:

cat file.fasta | awk '/^>/{if(N>0) printf("\n"); ++N; printf("%s\t",$0);next;} {printf("%s",$0);}END{printf("\n");}'

 

 

ADD COMMENTlink written 3.6 years ago by Pierre Lindenbaum115k
1

Just a bit sorter? :)

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file.fa
ADD REPLYlink written 3.6 years ago by iraun3.5k

Wow, a very compact way to handle multiline entries. Excellent!

ADD REPLYlink written 3.6 years ago by Manu Prestat3.9k

Even shorter! :)

awk 'NR==1 {print ; next} {printf (/^>/) ? "\n"$0"\n" : $1}' file.fas
ADD REPLYlink written 3.6 years ago by Frédéric Mahé2.9k

@Frederic: doesn't work.

ADD REPLYlink written 3.6 years ago by Pierre Lindenbaum115k

@Pierre: hum, it works well on my config (GNU Awk 4.1.1). Maybe a problem with older Awk versions (Awk < 4.0)?

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Frédéric Mahé2.9k
5
gravatar for David Langenberger
3.6 years ago by
Deutschland
David Langenberger8.4k wrote:
## fastq2fasta
zcat file.fastq.gz | paste - - - - | perl -ane 'print ">$F[0]\n$F[2]\n";' | gzip -c > file.fasta.gz

## bam2bed
samtools view file.bam | perl -F'\t' -ane '$strand=($F[1]&16)?"-":"+";$length=1;$tmp=$F[5];$tmp =~ s/(\d+)[MD]/$length+=$1/eg;print "$F[2]\t$F[3]\t".($F[3]+$length)."\t$F[0]\t0\t$strand\n";' > file.bed
##bam2wig
samtools mpileup -BQ0 file.sorted.bam | perl -pe '($c, $start, undef, $depth) = split;if ($c ne $lastC || $start != $lastStart+1) {print "fixedStep chrom=$c start=$start step=1 span=1\n";}$_ = $depth."\n";($lastC, $lastStart) = ($c, $start);' | gzip -c > file.wig.gz
ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by David Langenberger8.4k

+1 Especially for the 1st one which avoids the installation of a specific tool  (e.g. fastx-toolkit).

ADD REPLYlink written 3.6 years ago by Manu Prestat3.9k
5
gravatar for venu
3.6 years ago by
venu5.7k
Germany
venu5.7k wrote:

Here a list I often use during my work.

My work always starts with a list of ids for which I need to extract fasta sequences..samtools - always a better choice

cut -c 2- ids.txt | xargs -n 1 samtools faidx file.fasta > out.fasta
ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by venu5.7k
1

Nice! And thanks for the link with your whole list.

ADD REPLYlink written 3.6 years ago by Manu Prestat3.9k
1

I still cannot help but find the existence of a list of 'favourite one liners' a bit paradoxical, in my opinion.

If they are used routinely for your day-to-day tasks, maybe it is time to 'graduate' them to an utility script, that may be easier to maintain and/or might benefit the others?  

 

ADD REPLYlink written 3.6 years ago by Saulius Lukauskas520
1

one-liners are like old-time Lego(tm) bricks, you can build lots of fun and different stuff with'em.

If you pack them into tools, they will become like current-day Legos: you attach a head, two arms and two legs to a body and voilá, "constructed" a robot.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by h.mon22k
5
gravatar for george.ry
3.6 years ago by
george.ry1.1k
United Kingdom
george.ry1.1k wrote:

Reproducible subsampling of a FASTQ file:

cat file.fq | paste - - - - | awk 'BEGIN{srand(1234)}{if(rand() < 0.01) print $0}' | tr '\t' '\n' > out.fq

srand() is the seed for the random number generator - keeps the subsampling the same when the script is run multiple times.  0.01 is the % of reads to output.

 

Deinterleaving a FASTQ:

cat file.fq | paste - - - - - - - - | tee >(cut -f1-4 | tr '\t' '\n' > out1.fq) | cut -f5-8 | tr '\t' '\n' > out2.fq

 

Archiving / unarchiving files:

Not exactly a one-liner, but I find this a very useful way to make files immutable, and most people probably won't have come across it.  An immutable file cannot be modified or written to, deleted, renamed, or moved - even by the superuser.  When I receive data or download it, this is the first thing I do, and it's saved a lot of heartache over the years.

To archive a file ... sudo chattr +i file.fq
To unarchive it again ... sudo chattr -i file.fq
ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by george.ry1.1k
4
gravatar for iraun
3.6 years ago by
iraun3.5k
Norway
iraun3.5k wrote:

There are quite a lot useful one liners. This is just a drop in the ocean :).

## Number of reads in a fastq file
cat file.fq | echo $((`wc -l`/4))

## Single line fasta file to multi-line fasta of 60 characteres each line
awk -v FS= '/^>/{print;next}{for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}}' file

## Sequence length of every entry in a multifasta file
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen = seqlen +length($0)}END{print seqlen}' file.fa
ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by iraun3.5k
1

" Single line fasta file to multi-line fasta of 60 characteres each line"

how about using

fold -w 60 file

?

ADD REPLYlink written 3.6 years ago by Pierre Lindenbaum115k

Yes, that one also works :). I didn't remember fold command!

ADD REPLYlink written 3.6 years ago by iraun3.5k

wouldn't this one wrap long header lines as well?

ADD REPLYlink written 3.6 years ago by h.mon22k

Yes, good point h.mon, fold command will also split the sequence names if their size is bigger than 60.

ADD REPLYlink written 3.6 years ago by iraun3.5k

Why don't you just use word count for "Sequence length of fasta file"?

grep -v ">" file.fa | wc -c
ADD REPLYlink written 3.6 years ago by David Langenberger8.4k
2

Ah, got it... for every entry

ADD REPLYlink written 3.6 years ago by David Langenberger8.4k

Yep :). Of course if you want the total length the most straightforward command is your command.

ADD REPLYlink written 3.6 years ago by iraun3.5k

As long as you don't forget the "

cf A: What Are The Most Common Stupid Mistakes In Bioinformatics? ;-)

 

ADD REPLYlink written 3.6 years ago by Manu Prestat3.9k
2
gravatar for h.mon
3.6 years ago by
h.mon22k
Brazil
h.mon22k wrote:

I've used several "one-liners" from Scriptome

ADD COMMENTlink written 3.6 years ago by h.mon22k

Although the idea of this post was to gather one-liners of GNU/Linux tools mainly, this is good to know. One other project of this type is Biopieces.

ADD REPLYlink written 3.6 years ago by Manu Prestat3.9k
1
gravatar for Harshal
3.6 years ago by
Harshal50
London, UK
Harshal50 wrote:

Running fastqc for all the fastq files in multiple sample folders in parallel mode.

for i in Sample_*/*.fastq.gz ; do echo echo fastqc $i\|qsub -cwd; done # will create commands

for i in Sample_*/*.fastq.gz ; do echo echo fastqc $i\|qsub -cwd; done|sh #launches jobs on cluster 
ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Harshal50
1

Most of these are better done with xargs:

ls Sample_*/*.fastq.gz | xargs -i echo fastqc {} \| qsub -cwd
ADD REPLYlink written 3.6 years ago by lh331k
1
gravatar for David Langenberger
3.6 years ago by
Deutschland
David Langenberger8.4k wrote:

Parallelize time-consuming processes on Unix systems

Using mpileup for a whole genome can take forever. So, handling each chromosome separately and parallely running them on several cores will speed up your pipeline. Using xargs you can easily realize it.

Example usage of xargs (-P is the number of parallel processes started - don't use more than the number of cores you have available):

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | xargs -I {} -n 1 -P 24 sh -c "samtools mpileup -BQ0 -d 100000 -uf yourGenome.fa -r {} yourFile.bam | bcftools view -vcg - > tmp.{}.vcf"

To merge the results afterwards, you might want to do something like this:

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | perl -ane 'system("cat tmp.$F[0].bcf >> yourFile.vcf");'
ADD COMMENTlink written 3.6 years ago by David Langenberger8.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1690 users visited in the last hour