Question: Custom Database Has Spaces
gravatar for mackie
15 months ago by
mackie0 wrote:

Hi everyone,

I am relatively new to DNA sequencing data and have a quick question. I have sequencing data and a custom database that was developed for a previous project in my lab. I have successfully ran split_libraries, and sorted out all of the good reads from my NGS run, however when I BLAST the data against the custom database it only ever gives the first part of the scientific name of the species. I have been looking up solutions to this for the past month, and have finally figured out that I think the database has spaces in the titles (which is why it is breaking every time it hits that first space in the scientific name). The person that created this database is no longer in the lab and I do not want to have to re-create the entire thing. Is there a way to change the headings of each sequence in the database file to replace the spaces with underscores, or something similar so that I can use the same blastn command I have been trying to use? I am assuming there is a sed way to do this, however I haven't been able to figure it out.


qiime blast linux ngs • 342 views
ADD COMMENTlink written 15 months ago by mackie0

Did you try googling sed replace space with underscore? Also, this is not strictly bioinformatics (it's got nothing to do with ngs, it's simply replacing spaces with underscores), so the question might get closed.

ADD REPLYlink modified 15 months ago • written 15 months ago by RamRS26k

Hi, I have tried that but everything that came up was just removing spaces from a sentence, and not altering an NGS database

ADD REPLYlink written 15 months ago by mackie0

"NGS database" is not a term used in the community to denote any particular idea, so I'm not sure what you're talking about. You seem to be working with a custom BLAST database. I don't think you can modify a database, especially its identifiers after it is created, so you will need to either recreate the database or tailor your query to fetch all the information the database has on the sequence.

ADD REPLYlink written 15 months ago by RamRS26k

Yes, sorry I am referring to a custom BLAST database and I have tried many different outfmt parameters to get the entire title of each sequence in the database, however nothing is working. It could be something completely different than spaces in the database headers, but this is what I am guessing is the issue. I have tried (among other things):

-outfmt '6 std salltitles'
-outfmt '6 salltitles'
-outfmt '6 std stitles'

and none of these are giving all information for the sequences in the database. Am I blindly overlooking something? Thanks!

ADD REPLYlink modified 15 months ago by RamRS26k • written 15 months ago by mackie0

That's strange. I don't think makeblastdb allows characters it cannot parse, and I don't think it allows duplicate identifiers either. Are you sure there is more to the subject header than is being pulled out? Try all subject accessors, not just salltitles. Try sgi,sacc and sseqid as well.

ADD REPLYlink written 15 months ago by RamRS26k

Yes, I tried typing in a list of 10 or so different options for 6 and it wasn't working. However, I played around with it yesterday with a sub-sample of sequences and it seemed to give the entire header if I used blastall and called blastn, rather than just running blastn... Not sure why the normal method wasn't working for me! Thank you everyone who tried to help me solve this dilemma! :)

ADD REPLYlink written 15 months ago by mackie0

It could be that the DB was created with ncbi-blast from a pre-blast+ version and some quirk messes with header retrieval between versions. It's pretty mysterious, you may want to email NCBI with your findings.

ADD REPLYlink written 15 months ago by RamRS26k

I don't know... sometimes replacing spaces with underscores is exactly what I seem to be doing all day...

ADD REPLYlink written 15 months ago by cschu1812.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1397 users visited in the last hour