What is the fastest way to extract a sequence ID from huge multiple FASTA file based on given sequence?
1
1
Entering edit mode
8.9 years ago

I have a file containing millions of FASTA protein sequences from more than 2000 species. I'm looking for an efficient way (faster than BLAST) to retrieve protein's ID for a given amino-acid sequence. I know that blastdbcmd can pull out an individual sequence record from the BLAST database based on given sequence identifier, but it doesn't work for querying sequences.

Do you know any tools that skip the "alignment building step" and allow for fast retrieval of a FASTA record based on its sequence?

FASTA sequence BLAST • 2.3k views
ADD COMMENT
1
Entering edit mode
8.9 years ago

If you know the exact sequence... grep? First you could reformat it to ensure all the letters are on the same line, using BBMap:

reformat.sh in=millions.fasta out=reformatted.fasta fastawrap=9999999
grep -B 1 'QWERTY' reformatted.fasta

Or you could stream it to avoid writing a new file:

reformat.sh in=millions.fasta out=stdout.fasta fastawrap=9999999 | grep -B 1 'QWERTY'
ADD COMMENT
1
Entering edit mode

Probably faster with "^QWERTY$" but if the aim is to do this for multiple sequences, there are way faster alternatives still that probably require indexing..

ADD REPLY

Login before adding your answer.

Traffic: 1933 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6