Question: Fast way to extract specific sequences from large fasta
gravatar for mnsp088
17 months ago by
mnsp08840 wrote:

Hi all!

I have ~2k text files, each with ~1k protein names (one protein name per line) and I need to extract the sequences of these proteins from a large master fasta file which contains ~5.5 million sequences. I wrote this code to do this task and it works but it is taking a really long time to process even 1 text file.

for file in `find text_files | grep .txt`;
echo $file
file2=$(echo $file | sed -re 's/^.+\///')
cut -c 1- ${file} | xargs -n 1 samtools faidx master_fasta.fs > ./fastas/$file2.fa

Any ideas on better ways to go about this? particularly to speed things up.

Thank you for your suggestions.

sequencing fasta • 727 views
ADD COMMENTlink modified 17 months ago by Joe18k • written 17 months ago by mnsp08840

You should show examples of the protein names from the protein names files and from the fasta file.

Also, did you edit the find text_files | grep .txt part? I don't think this command would find anything at all.

Anyway, I think

samtools faidx master_fasta.fs $(cut -c 1- ${file}) > ./fastas/$file2.fa

would be faster than

cut -c 1- ${file} | xargs -n 1 samtools faidx master_fasta.fs > ./fastas/$file2.fa

edit: you can also do

samtools faidx -r $file master_fasta.fs > ./fastas/$file2.fa

edit 2: I didn't state clearly, but I believe what is slowing your command is xargs -n 1, which is not necessary anyway.

ADD REPLYlink modified 17 months ago • written 17 months ago by h.mon31k

Thank you for your suggestions, and all suggestions below. I would like to report that I have run all of them, and your solution

samtools faidx -r $file master_fasta.fs > ./fastas/$file2.fa

was the quickest. faSomeRecords also sped it up a lot, but this solution was still quicker.

Thanks again!

ADD REPLYlink modified 17 months ago • written 17 months ago by mnsp08840

faSomeRecords utility from Jim Kent at UCSC should be a fast way to do this.

Additional options in: Extract reads from fasta file (specific read_names)and make a new fasta file

ADD REPLYlink written 17 months ago by genomax92k
gravatar for Joe
17 months ago by
United Kingdom
Joe18k wrote:

Indexing is almost certainly the right way to do this. Even with biopython which will typically be slower, its likely do-able using SeqIO's index() method ( ):

from Bio import SeqIO

index = SeqIO.index('bigfasta.fa', 'fasta')

I don't have a massive fasta file to hand to actually test this performance though.

It will likely still be slower than faidx or faSomeRecords implemented in straight up faster languages.

ADD COMMENTlink written 17 months ago by Joe18k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1029 users visited in the last hour