Question: Fast way to extract specific sequences from large fasta
0
gravatar for mnsp088
3 days ago by
mnsp08820
mnsp08820 wrote:

Hi all!

I have ~2k text files, each with ~1k protein names (one protein name per line) and I need to extract the sequences of these proteins from a large master fasta file which contains ~5.5 million sequences. I wrote this code to do this task and it works but it is taking a really long time to process even 1 text file.

for file in `find text_files | grep .txt`;
do
echo $file
file2=$(echo $file | sed -re 's/^.+\///')
cut -c 1- ${file} | xargs -n 1 samtools faidx master_fasta.fs > ./fastas/$file2.fa
done

Any ideas on better ways to go about this? particularly to speed things up.

Thank you for your suggestions.

sequencing fasta • 121 views
ADD COMMENTlink modified 2 days ago by jrj.healey12k • written 3 days ago by mnsp08820
1

You should show examples of the protein names from the protein names files and from the fasta file.

Also, did you edit the find text_files | grep .txt part? I don't think this command would find anything at all.

Anyway, I think

samtools faidx master_fasta.fs $(cut -c 1- ${file}) > ./fastas/$file2.fa

would be faster than

cut -c 1- ${file} | xargs -n 1 samtools faidx master_fasta.fs > ./fastas/$file2.fa

edit: you can also do

samtools faidx -r $file master_fasta.fs > ./fastas/$file2.fa

edit 2: I didn't state clearly, but I believe what is slowing your command is xargs -n 1, which is not necessary anyway.

ADD REPLYlink modified 2 days ago • written 3 days ago by h.mon25k
1

Thank you for your suggestions, and all suggestions below. I would like to report that I have run all of them, and your solution

samtools faidx -r $file master_fasta.fs > ./fastas/$file2.fa

was the quickest. faSomeRecords also sped it up a lot, but this solution was still quicker.

Thanks again!

ADD REPLYlink modified 2 days ago • written 2 days ago by mnsp08820
1

faSomeRecords utility from Jim Kent at UCSC should be a fast way to do this.

Additional options in: Extract reads from fasta file (specific read_names)and make a new fasta file

ADD REPLYlink written 3 days ago by genomax68k
1
gravatar for jrj.healey
2 days ago by
jrj.healey12k
United Kingdom
jrj.healey12k wrote:

Indexing is almost certainly the right way to do this. Even with biopython which will typically be slower, its likely do-able using SeqIO's index() method ( https://biopython.org/wiki/SeqIO ):

from Bio import SeqIO

index = SeqIO.index('bigfasta.fa', 'fasta')
print(index['myfastaID'])

I don't have a massive fasta file to hand to actually test this performance though.

It will likely still be slower than faidx or faSomeRecords implemented in straight up faster languages.

ADD COMMENTlink written 2 days ago by jrj.healey12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 526 users visited in the last hour