Question: extract accessions from blastdbcmd
1
gravatar for navela78
16 months ago by
navela7850
navela7850 wrote:

My copy of the nr database has about 148 million sequences which i found using blastdbcmd command using the -info option.

However when I try to extract the accession numbers using the following command:

blastdbcmd -db nr -entry all -outfmt "%a"

I am getting much more than 148 million accessions. I am guessing it is because multiple genbank accessions having the exact same sequence will be represented with only one sequence but with multiple accessions in the nr database.

What is the best way to get only one accession per sequence using blastdbcmd? I donn't care which accession it is but the number of accessions returned by blastdbcmd should match the number of sequences in it. How do I do this?

blast blastdbcmd • 1.1k views
ADD COMMENTlink modified 16 months ago by Biostar ♦♦ 20 • written 16 months ago by navela7850

Can I ask why you want to do this? You can find the fasta files (which are used to build the nr database) here. You could get the ID's either way and process them to leave just the first entry in fasta header.

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax70k

Thanks genomax. I just thought it will be easier to extract accessions this way than to write a script. Looks like there is no straight forward way in blastdbcmd to extract one accession per sequence? In that case I can work with the fasta file like you mentioned.

My ultimate goal is to see how blastdbs perform for random access of sequence data. I am developing a software that needs sequences to be quickly extracted given a list of accessions. Any thoughts on this? Should I start a new question on indexing strategies? I am thinking of testing blastdb, Python seqio indexing, and samtools (I read somewhere that samtools can do this). Not sure which one will be faster. My db will have about 500 million - 1 billion sequences.

ADD REPLYlink modified 16 months ago • written 16 months ago by navela7850

blastdbcmd to extract one accession per sequence?

Does it matter if there are other homologs in the sequence header? You could either re-write the header on the fly to just include the accession you searched with.

Edit (looks like there are 102 30S ribosomal protein S18):

$ blastdbcmd -db nr -entry AE006448_5 -outfmt "%i" | wc -l
102
$ blastdbcmd -db nr -entry OEU38668.1 -outfmt "%i" | wc -l
102

You get all identifiers if you asked for the fasta sequence with any one of those IDs.

samtools faidx, seqtk subseq, @Matt Shirley's pyfaidx can all be used to rapidly extract sequences from fasta files. There is faSomeRecords from Jim Kent's UCSC utilities that is very fast as well and does not need sequences to be indexed. SeqKit by @shenwei may also be another option.

You can also start a new question if you have thoughts about indexing strategies.

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax70k

Thanks genomax. Sorry for the miscommunication ... but i was looking to extract one accession per sequence for all sequences in nr. Looks like there is no blastdbcmd option for it ... so like you mentioned I created it on the fly like so:

blastdbcmd -db nr -entry all -outfmt "%f" | awk '/>(.+) / {print substr($1,2);}'

this worked for me for now ... however it is slow because it is parsing through the sequences as well (outfmt is %f). I looked through the documentation and I don't think there is a way to extract only the def line

Thanks for your pointers on indexing ... I will look into them and start a separate thread if I have questions

ADD REPLYlink modified 16 months ago • written 16 months ago by navela7850

interesting (and confirmed this with our own local nr db ;)) .

I hope it's correct to assume they mean same sequence present in different DBs right, and not as in a sequence that is 100% identical but from different species?

ADD REPLYlink written 16 months ago by lieven.sterck5.5k
1

See for yourself

blastdbcmd -db nr -entry OEU38668.1 -outfmt "%f"

or

blastdbcmd -db nr -entry AE006448_5 -outfmt "%f"
ADD REPLYlink modified 16 months ago • written 16 months ago by genomax70k

was doing that while you were writing your reply :)

and yes, it's thus a combination of both: exact same seq from diff species as well as same entry from diff DBs

ADD REPLYlink modified 16 months ago • written 16 months ago by lieven.sterck5.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 835 users visited in the last hour