Question: Linearize fasta files
3
gravatar for finswimmer
10 months ago by
finswimmer13k
Germany
finswimmer13k wrote:

From time to time it is necessary to linearize a multiline fasta file.

>in1
AAAA
AAAA
AA
>in2
BBBB
BBBB
B

becomes

>in1
AAAAAAAAAA
>in2
BBBBBBBBB

The fastes way I came up is to use seqkitfor it:

$ seqkit seq -w 0 input.fa

What is the fastest scripting method you are aware? This is what I've found:

$ LC_ALL=C awk -v RS=">" -v FS="\n" -v ORS="\n" -v OFS="" '$0 {$1=">"$1"\n"; print}' input.fa

Essentially I'm using awk to change the field and record separators. It's fast but still slower as seqkit for large files. Interestingly changing the separators is something where mawk seems to be very, very slow :(

fin swimmer

EDIT:

For those who like to compare their solution. I've used Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz from ensembl as the input.

Here are some benchmarking results including seqtk as suggested by shenwei356 :

$ time (seqkit seq -w 0 Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
( seqkit seq -w 0  > /dev/null; )  1,57s user 0,82s system 109% cpu 2,175 total

$ time (seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
( seqtk seq  > /dev/null; )  2,22s user 0,59s system 99% cpu 2,817 total

$ time (LC_ALL=C awk -v RS=">" -v FS="\n" -v ORS="\n" -v OFS="" '$0 {$1=">"$1"\n"; print}' Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
( LC_ALL=C awk -v RS=">" -v FS="\n" -v ORS="\n" -v OFS=""   > /dev/null; )  5,58s user 1,58s system 99% cpu 7,180 total

$ time (awk -v RS=">" -v FS="\n" -v ORS="\n" -v OFS="" '$0 {$1=">"$1"\n"; print}'  Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
( awk -v RS=">" -v FS="\n" -v ORS="\n" -v OFS="" '$0 {$1=">"$1"\n"; print}'  )  47,43s user 1,62s system 99% cpu 49,099 total
code golf fasta • 661 views
ADD COMMENTlink modified 10 months ago by genomax75k • written 10 months ago by finswimmer13k

The speed of that awk command is very impressive. It beats bioawk (bioawk -c fastx '{print ">"$name; print $seq}') with a clear margin. I very much doubt that a faster "scripting" solution exist. Maybe some perl one-liner..

Edit. comparison of vanilla seqtk and seqtk made with -O3 ("optimized") instead of -O2. "Optimized" is faster every time..

$ time (./seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real    0m2.374s
user    0m1.565s
sys 0m0.809s
$ time (./seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real    0m2.228s
user    0m1.574s
sys 0m0.654s
$ time (./seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real    0m2.238s
user    0m1.571s
sys 0m0.667s
$ time (./optimized/seqtk/seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real    0m2.207s
user    0m1.567s
sys 0m0.641s
$ time (./optimized/seqtk/seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real    0m2.213s
user    0m1.561s
sys 0m0.652s
$ time (./optimized/seqtk/seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real    0m2.204s
user    0m1.579s
sys 0m0.625s
ADD REPLYlink modified 10 months ago • written 10 months ago by 5heikki8.6k
2
gravatar for shenwei356
10 months ago by
shenwei3565.0k
China
shenwei3565.0k wrote:

seqtk seq input.fa (1.3-r106 68752fd) is faster than seqkit seq -w 0 input.fa (v0.10.0) in my test on both SSD and HDD.

Version:

OS & Hardware:

  • OS : Linux BioS 4.19.20-1-MANJARO
  • CPU: AMD Ryzen 7 2700X Eight-Core Processor, 3.7 GHz
  • SSD: Samsung 970EVO 500G NVMe SSD
  • HDD: 6T SEAGATE BarraCuda Pro 7200RPM 256M SATA3
  • RAM: 48GB

Note: Clearing RAM memory cache before running every command.

su -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

SSD:

$ time (cat Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m1.231s
user    0m0.004s
sys     0m0.513s

$ time (seqkit seq -w 0 Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m2.487s
user    0m1.368s
sys     0m0.974s

$ time (seqkit seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m3.408s
user    0m2.340s
sys     0m0.871s

$ time (seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m1.813s
user    0m1.035s
sys     0m0.682s

# seqtk before pull#123
$ time (seqtk0 seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m3.565s
user    0m2.837s
sys     0m0.647s

HDD:

$ time (cat Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m14.144s
user    0m0.009s
sys     0m0.750s

$ time (seqkit seq -w 0 Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m15.472s
user    0m1.459s
sys     0m0.941s

$ time (seqkit seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m16.196s
user    0m2.454s
sys     0m0.875s

$ time (seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m14.224s
user    0m1.205s
sys     0m0.734s

# seqtk before pull#123
$ time (seqtk0 seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m14.212s
user    0m3.007s
sys     0m0.736s

ADD COMMENTlink modified 10 months ago • written 10 months ago by shenwei3565.0k
1

You can easily tell that the patch was worth it by comparing the user times of seqtk before and after. \o/

ADD REPLYlink written 10 months ago by kloetzl1.1k

( ̄ へ ̄ )Yes, the patch is worth it. Thank you Fabian for this great patch!

ADD REPLYlink written 10 months ago by shenwei3565.0k
2
gravatar for genomax
10 months ago by
genomax75k
United States
genomax75k wrote:

Program versions used:

BBMap - v. 38.32
Seqtk - v. 1.3-r106
Seqkit - v. 0.8.1
Perl - v. 5.16.3
Python - v. 3.6.6
sed - v. 2.2.2

$ time (cat Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m1.050s
user    0m0.002s
sys     0m1.045s

With BBMap - reformat.sh

$ time reformat.sh -Xmx40g in=Homo_sapiens.GRCh38.dna.primary_assembly.fa fastawrap=0)
java -ea -Xmx40g -cp bbmap/current/ jgi.ReformatReads -Xmx40g in=Homo_sapiens.GRCh38.dna.primary_assembly.fa fastawrap=0
Executing jgi.ReformatReads [-Xmx40g, in=Homo_sapiens.GRCh38.dna.primary_assembly.fa, fastawrap=0]

No output stream specified.  To write to stdout, please specify 'out=stdout.fq' or similar.
Input is being processed as unpaired
Input:                          194 reads               3099750718 bases
Output:                         194 reads (100.00%)     3099750718 bases (100.00%)

Time:                           14.052 seconds.
Reads Processed:         194    0.01k reads/sec
Bases Processed:       3099m    220.59m bases/sec

real    0m14.216s
user    0m44.320s
sys     0m4.096s

To provide a reference other program options discussed above (on same hardware):

From @shenwei

$ time (seqkit seq -w 0 Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m4.022s
user    0m2.387s
sys     0m1.868s

$ time (seqkit seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m5.135s
user    0m3.530s
sys     0m1.843s

From @Heng Li

$ time (seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m4.682s
user    0m3.016s
sys     0m1.417s

From @Pierre

$ time (awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m28.235s
user    0m26.903s
sys     0m1.331s

$ time (LC_ALL=C awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m27.308s
user    0m26.014s
sys     0m1.288s

$ time (LC_ALL=C awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < Homo_sapiens.GRCh38.dna.primary_assembly.fa | tr "\t" "\n"  > /dev/null)


Note: This is done to convert linearized fasta back into proper format.


real    0m33.568s
user    0m33.478s
sys     0m4.654s

From @finswimmer

$  time (awk -v RS=">" -v FS="\n" -v ORS="\n" -v OFS="" '$0 {$1=">"$1"\n"; print}'  Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    1m3.129s
user    0m59.192s
sys     0m3.928s

$  time (sed -e 's/\(^>.*$\)/#\1#/' Homo_sapiens.GRCh38.dna.primary_assembly.fa | tr -d "\r" | tr -d "\n" | sed -e 's/$/#/' | tr "#" "\n" | sed -e '/^$/d' > /dev/null)

real    0m27.813s
user    0m28.834s
sys     0m18.141s

From @Jorge Amigo

$ time (perl -pe 'chomp unless /^>/' < Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m17.720s
user    0m15.785s
sys     0m1.925s

From @jrj.healey (needs python v.3.6+)

$ time (python3 -c 'import sys;from Bio import SeqIO; [print(f">{r.id}\n{r.seq}") for r in SeqIO.parse(sys.argv[1], "fasta")];' Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m40.869s
user    0m30.741s
sys     0m9.864s
ADD COMMENTlink modified 10 months ago • written 10 months ago by genomax75k

Would be interesting to see how replacment of awk by mawk performs. The latter typically gives me notable speed improvements for these kinds of operations.

ADD REPLYlink written 10 months ago by ATpoint26k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1802 users visited in the last hour