Forum:Linearize fasta files
4
6
Entering edit mode
5.2 years ago

From time to time it is necessary to linearize a multiline fasta file.

>in1
AAAA
AAAA
AA
>in2
BBBB
BBBB
B

becomes

>in1
AAAAAAAAAA
>in2
BBBBBBBBB

The fastes way I came up is to use seqkitfor it:

$ seqkit seq -w 0 input.fa

What is the fastest scripting method you are aware? This is what I've found:

$ LC_ALL=C awk -v RS=">" -v FS="\n" -v ORS="\n" -v OFS="" '$0 {$1=">"$1"\n"; print}' input.fa

Essentially I'm using awk to change the field and record separators. It's fast but still slower as seqkit for large files. Interestingly changing the separators is something where mawk seems to be very, very slow :(

fin swimmer

EDIT:

For those who like to compare their solution. I've used Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz from ensembl as the input.

Here are some benchmarking results including seqtk as suggested by shenwei356 :

$ time (seqkit seq -w 0 Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
( seqkit seq -w 0  > /dev/null; )  1,57s user 0,82s system 109% cpu 2,175 total

$ time (seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
( seqtk seq  > /dev/null; )  2,22s user 0,59s system 99% cpu 2,817 total

$ time (LC_ALL=C awk -v RS=">" -v FS="\n" -v ORS="\n" -v OFS="" '$0 {$1=">"$1"\n"; print}' Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
( LC_ALL=C awk -v RS=">" -v FS="\n" -v ORS="\n" -v OFS=""   > /dev/null; )  5,58s user 1,58s system 99% cpu 7,180 total

$ time (awk -v RS=">" -v FS="\n" -v ORS="\n" -v OFS="" '$0 {$1=">"$1"\n"; print}'  Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
( awk -v RS=">" -v FS="\n" -v ORS="\n" -v OFS="" '$0 {$1=">"$1"\n"; print}'  )  47,43s user 1,62s system 99% cpu 49,099 total
fasta code-golf • 6.5k views
ADD COMMENT
0
Entering edit mode

The speed of that awk command is very impressive. It beats bioawk (bioawk -c fastx '{print ">"$name; print $seq}') with a clear margin. I very much doubt that a faster "scripting" solution exist. Maybe some perl one-liner..

Edit. comparison of vanilla seqtk and seqtk made with -O3 ("optimized") instead of -O2. "Optimized" is faster every time..

$ time (./seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real    0m2.374s
user    0m1.565s
sys 0m0.809s
$ time (./seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real    0m2.228s
user    0m1.574s
sys 0m0.654s
$ time (./seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real    0m2.238s
user    0m1.571s
sys 0m0.667s
$ time (./optimized/seqtk/seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real    0m2.207s
user    0m1.567s
sys 0m0.641s
$ time (./optimized/seqtk/seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real    0m2.213s
user    0m1.561s
sys 0m0.652s
$ time (./optimized/seqtk/seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
real    0m2.204s
user    0m1.579s
sys 0m0.625s
ADD REPLY
6
Entering edit mode
5.2 years ago
GenoMax 141k

Program versions used:

BBMap - v. 38.32
Seqtk - v. 1.3-r106
Seqkit - v. 0.8.1
Perl - v. 5.16.3
Python - v. 3.6.6
sed - v. 2.2.2

$ time (cat Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m1.050s
user    0m0.002s
sys     0m1.045s

With BBMap - reformat.sh

$ time reformat.sh -Xmx40g in=Homo_sapiens.GRCh38.dna.primary_assembly.fa fastawrap=0)
java -ea -Xmx40g -cp bbmap/current/ jgi.ReformatReads -Xmx40g in=Homo_sapiens.GRCh38.dna.primary_assembly.fa fastawrap=0
Executing jgi.ReformatReads [-Xmx40g, in=Homo_sapiens.GRCh38.dna.primary_assembly.fa, fastawrap=0]

No output stream specified.  To write to stdout, please specify 'out=stdout.fq' or similar.
Input is being processed as unpaired
Input:                          194 reads               3099750718 bases
Output:                         194 reads (100.00%)     3099750718 bases (100.00%)

Time:                           14.052 seconds.
Reads Processed:         194    0.01k reads/sec
Bases Processed:       3099m    220.59m bases/sec

real    0m14.216s
user    0m44.320s
sys     0m4.096s

To provide a reference other program options discussed above (on same hardware):

From @shenwei

$ time (seqkit seq -w 0 Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m4.022s
user    0m2.387s
sys     0m1.868s

$ time (seqkit seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m5.135s
user    0m3.530s
sys     0m1.843s

From @Heng Li

$ time (seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m4.682s
user    0m3.016s
sys     0m1.417s

From @Pierre

$ time (awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m28.235s
user    0m26.903s
sys     0m1.331s

$ time (LC_ALL=C awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m27.308s
user    0m26.014s
sys     0m1.288s

$ time (LC_ALL=C awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < Homo_sapiens.GRCh38.dna.primary_assembly.fa | tr "\t" "\n"  > /dev/null)


Note: This is done to convert linearized fasta back into proper format.


real    0m33.568s
user    0m33.478s
sys     0m4.654s

From @finswimmer

$  time (awk -v RS=">" -v FS="\n" -v ORS="\n" -v OFS="" '$0 {$1=">"$1"\n"; print}'  Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    1m3.129s
user    0m59.192s
sys     0m3.928s

$  time (sed -e 's/\(^>.*$\)/#\1#/' Homo_sapiens.GRCh38.dna.primary_assembly.fa | tr -d "\r" | tr -d "\n" | sed -e 's/$/#/' | tr "#" "\n" | sed -e '/^$/d' > /dev/null)

real    0m27.813s
user    0m28.834s
sys     0m18.141s

From @Jorge Amigo

$ time (perl -pe 'chomp unless /^>/' < Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m17.720s
user    0m15.785s
sys     0m1.925s

From @jrj.healey (needs python v.3.6+)

$ time (python3 -c 'import sys;from Bio import SeqIO; [print(f">{r.id}\n{r.seq}") for r in SeqIO.parse(sys.argv[1], "fasta")];' Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m40.869s
user    0m30.741s
sys     0m9.864s
ADD COMMENT
0
Entering edit mode

Would be interesting to see how replacment of awk by mawk performs. The latter typically gives me notable speed improvements for these kinds of operations.

ADD REPLY
0
Entering edit mode

Wow, what a nice comparison. Unfortunately I must say that my perl suggestion doesn't work as expected:

$ time (perl -pe 'chomp unless /^>/' < Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

does indeed join each header to the end of the previous sequence. This very slightly slower alternative should be considered:

$ time (perl -pe 'if (/^>/) { $. > 1 and print "\n" } else { chomp }' < Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)
ADD REPLY
4
Entering edit mode
5.2 years ago

seqtk seq input.fa (1.3-r106 68752fd) is faster than seqkit seq -w 0 input.fa (v0.10.0) in my test on both SSD and HDD.

Version:

OS & Hardware:

  • OS : Linux BioS 4.19.20-1-MANJARO
  • CPU: AMD Ryzen 7 2700X Eight-Core Processor, 3.7 GHz
  • SSD: Samsung 970EVO 500G NVMe SSD
  • HDD: 6T SEAGATE BarraCuda Pro 7200RPM 256M SATA3
  • RAM: 48GB

Note: Clearing RAM memory cache before running every command.

su -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

SSD:

$ time (cat Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m1.231s
user    0m0.004s
sys     0m0.513s

$ time (seqkit seq -w 0 Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m2.487s
user    0m1.368s
sys     0m0.974s

$ time (seqkit seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m3.408s
user    0m2.340s
sys     0m0.871s

$ time (seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m1.813s
user    0m1.035s
sys     0m0.682s

# seqtk before pull#123
$ time (seqtk0 seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m3.565s
user    0m2.837s
sys     0m0.647s

HDD:

$ time (cat Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m14.144s
user    0m0.009s
sys     0m0.750s

$ time (seqkit seq -w 0 Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m15.472s
user    0m1.459s
sys     0m0.941s

$ time (seqkit seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m16.196s
user    0m2.454s
sys     0m0.875s

$ time (seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m14.224s
user    0m1.205s
sys     0m0.734s

# seqtk before pull#123
$ time (seqtk0 seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null)

real    0m14.212s
user    0m3.007s
sys     0m0.736s

ADD COMMENT
1
Entering edit mode

You can easily tell that the patch was worth it by comparing the user times of seqtk before and after. \o/

ADD REPLY
0
Entering edit mode

( ̄ へ ̄ )Yes, the patch is worth it. Thank you Fabian for this great patch!

ADD REPLY
0
Entering edit mode
2.6 years ago
bacon4000 ▴ 80

Good info here!

FYI, I recently wrote a tool to convert FASTA and FASTQ to linearized TSV, so one can use a variety of line-oriented tools like grep, awk, sort, etc. on the stream:

https://github.com/auerlab/biolibc-tools/blob/main/fastx2tsv.c

Converting back to FAST[AQ] format can then be done with a simple awk printf if needed:

# Search sequences only
xzcat file.fq.txz | fastx2tsv | \
    awk -F '\t' '$2 ~ "pattern" { printf("%s\n%s\n%s\n%s\n", $1, $2, $3, $4); }'

Hopefully it will prove useful.

Currently it's not as CPU-efficient as seqtk or seqkit because the biolibc functions use simple stream I/O [getc()/putc()] instead of block I/O. That will be addressed at some point in the future when time permits.

ADD COMMENT
0
Entering edit mode

People may want to rerun the benchmarks without dumping output to /dev/null for more realistic comparisons. When I write the output to my SSD, the CPU time that fastx2tsv wastes copying from one stream buffer to another becomes mostly inconsequential. Which tool runs fastest varies from run to run, but they are typically within a few % of each other, so fastx2tsv, seqkit, and seqtk essential have the same wall time. I'm not flushing the read buffers before each run as some did, but running wc and cat on the same input file beforehand ensures that the first tool to run isn't penalized for the benefit of the others.

time wc -l Homo_sapiens.GRCh38.dna.primary_assembly.fa                          
 51662809 Homo_sapiens.GRCh38.dna.primary_assembly.fa                           
        8.63 real         5.69 user         0.75 sys                            

time cat Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null                
        0.99 real         0.07 user         0.91 sys                            

time fastx2tsv < Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null 
        9.36 real         8.28 user         1.07 sys                            

time seqkit seq -w 0 Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null    
        2.69 real         1.85 user         1.00 sys                            

time seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > /dev/null          
        4.57 real         3.51 user         1.06 sys                            

time cat Homo_sapiens.GRCh38.dna.primary_assembly.fa > cat.fa                   
       29.62 real         0.01 user         2.03 sys                            

time fastx2tsv < Homo_sapiens.GRCh38.dna.primary_assembly.fa > fastx2tsv.tsv    
       47.75 real         8.81 user         2.10 sys                            

time seqkit seq -w 0 Homo_sapiens.GRCh38.dna.primary_assembly.fa > seqkit.fa    
       53.81 real         2.41 user         2.74 sys                            

time seqtk seq Homo_sapiens.GRCh38.dna.primary_assembly.fa > seqtk.fa           
       50.99 real         5.68 user         3.24 sys                            
ADD REPLY
0
Entering edit mode
2.6 years ago
Divon ▴ 230

...and one more option:

genozip Homo_sapiens.GRCh38.dna.primary_assembly.fa
genocat --sequential Homo_sapiens.GRCh38.dna.primary_assembly.fa.genozip

Genozip is primarily a compression tool, so it will not be as fast as awk, but it is highly scalable with cores, so the second step might actually be quite fast if you throw a sufficient number of cores at it. its 1am here in Australia, so even though I'm very tempted, I'm not going to benchmark it now :)

Documentation: https://genozip.com

ADD COMMENT

Login before adding your answer.

Traffic: 2691 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6