Question: Fast way to extract specific sequences from large fasta
gravatar for mnsp088
3 days ago by
mnsp08820 wrote:

Hi all!

I have ~2k text files, each with ~1k protein names (one protein name per line) and I need to extract the sequences of these proteins from a large master fasta file which contains ~5.5 million sequences. I wrote this code to do this task and it works but it is taking a really long time to process even 1 text file.

for file in `find text_files | grep .txt`;
echo $file
file2=$(echo $file | sed -re 's/^.+\///')
cut -c 1- ${file} | xargs -n 1 samtools faidx master_fasta.fs > ./fastas/$file2.fa

Any ideas on better ways to go about this? particularly to speed things up.

Thank you for your suggestions.

sequencing fasta • 121 views
ADD COMMENTlink modified 2 days ago by jrj.healey12k • written 3 days ago by mnsp08820

You should show examples of the protein names from the protein names files and from the fasta file.

Also, did you edit the find text_files | grep .txt part? I don't think this command would find anything at all.

Anyway, I think

samtools faidx master_fasta.fs $(cut -c 1- ${file}) > ./fastas/$file2.fa

would be faster than

cut -c 1- ${file} | xargs -n 1 samtools faidx master_fasta.fs > ./fastas/$file2.fa

edit: you can also do

samtools faidx -r $file master_fasta.fs > ./fastas/$file2.fa

edit 2: I didn't state clearly, but I believe what is slowing your command is xargs -n 1, which is not necessary anyway.

ADD REPLYlink modified 2 days ago • written 3 days ago by h.mon25k

Thank you for your suggestions, and all suggestions below. I would like to report that I have run all of them, and your solution

samtools faidx -r $file master_fasta.fs > ./fastas/$file2.fa

was the quickest. faSomeRecords also sped it up a lot, but this solution was still quicker.

Thanks again!

ADD REPLYlink modified 2 days ago • written 2 days ago by mnsp08820

faSomeRecords utility from Jim Kent at UCSC should be a fast way to do this.

Additional options in: Extract reads from fasta file (specific read_names)and make a new fasta file

ADD REPLYlink written 3 days ago by genomax68k
gravatar for jrj.healey
2 days ago by
United Kingdom
jrj.healey12k wrote:

Indexing is almost certainly the right way to do this. Even with biopython which will typically be slower, its likely do-able using SeqIO's index() method ( ):

from Bio import SeqIO

index = SeqIO.index('bigfasta.fa', 'fasta')

I don't have a massive fasta file to hand to actually test this performance though.

It will likely still be slower than faidx or faSomeRecords implemented in straight up faster languages.

ADD COMMENTlink written 2 days ago by jrj.healey12k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 526 users visited in the last hour