Question

What is the fastest way to extract a sequence ID from huge multiple FASTA file based on given sequence?

1

Entering edit mode

8.9 years ago

Andrzej Zielezinski 11k

I have a file containing millions of FASTA protein sequences from more than 2000 species. I'm looking for an efficient way (faster than BLAST) to retrieve protein's ID for a given amino-acid sequence. I know that blastdbcmd can pull out an individual sequence record from the BLAST database based on given sequence identifier, but it doesn't work for querying sequences.

Do you know any tools that skip the "alignment building step" and allow for fast retrieval of a FASTA record based on its sequence?

FASTA sequence BLAST • 2.3k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.9 years ago by Andrzej Zielezinski 11k

Ram · Answer 1 · 2015-06-12

1

Entering edit mode

8.9 years ago

Brian Bushnell 20k

If you know the exact sequence... grep? First you could reformat it to ensure all the letters are on the same line, using BBMap:

reformat.sh in=millions.fasta out=reformatted.fasta fastawrap=9999999
grep -B 1 'QWERTY' reformatted.fasta

Or you could stream it to avoid writing a new file:

reformat.sh in=millions.fasta out=stdout.fasta fastawrap=9999999 | grep -B 1 'QWERTY'

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.9 years ago by Brian Bushnell 20k

1

Entering edit mode

Probably faster with "^QWERTY$" but if the aim is to do this for multiple sequences, there are way faster alternatives still that probably require indexing..

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.9 years ago by 5heikki 11k