Custom Database Has Spaces
0
0
Entering edit mode
5.3 years ago
mackie • 0

Hi everyone,

I am relatively new to DNA sequencing data and have a quick question. I have sequencing data and a custom database that was developed for a previous project in my lab. I have successfully ran split_libraries, and sorted out all of the good reads from my NGS run, however when I BLAST the data against the custom database it only ever gives the first part of the scientific name of the species. I have been looking up solutions to this for the past month, and have finally figured out that I think the database has spaces in the titles (which is why it is breaking every time it hits that first space in the scientific name). The person that created this database is no longer in the lab and I do not want to have to re-create the entire thing. Is there a way to change the headings of each sequence in the database file to replace the spaces with underscores, or something similar so that I can use the same blastn command I have been trying to use? I am assuming there is a sed way to do this, however I haven't been able to figure it out.

Thanks!

NGS blast linux qiime • 1.2k views
ADD COMMENT
0
Entering edit mode

Did you try googling sed replace space with underscore? Also, this is not strictly bioinformatics (it's got nothing to do with ngs, it's simply replacing spaces with underscores), so the question might get closed.

ADD REPLY
0
Entering edit mode

Hi, I have tried that but everything that came up was just removing spaces from a sentence, and not altering an NGS database

ADD REPLY
0
Entering edit mode

"NGS database" is not a term used in the community to denote any particular idea, so I'm not sure what you're talking about. You seem to be working with a custom BLAST database. I don't think you can modify a database, especially its identifiers after it is created, so you will need to either recreate the database or tailor your query to fetch all the information the database has on the sequence.

ADD REPLY
0
Entering edit mode

Yes, sorry I am referring to a custom BLAST database and I have tried many different outfmt parameters to get the entire title of each sequence in the database, however nothing is working. It could be something completely different than spaces in the database headers, but this is what I am guessing is the issue. I have tried (among other things):

-outfmt '6 std salltitles'
-outfmt '6 salltitles'
-outfmt '6 std stitles'

and none of these are giving all information for the sequences in the database. Am I blindly overlooking something? Thanks!

ADD REPLY
0
Entering edit mode

That's strange. I don't think makeblastdb allows characters it cannot parse, and I don't think it allows duplicate identifiers either. Are you sure there is more to the subject header than is being pulled out? Try all subject accessors, not just salltitles. Try sgi,sacc and sseqid as well.

ADD REPLY
0
Entering edit mode

Yes, I tried typing in a list of 10 or so different options for 6 and it wasn't working. However, I played around with it yesterday with a sub-sample of sequences and it seemed to give the entire header if I used blastall and called blastn, rather than just running blastn... Not sure why the normal method wasn't working for me! Thank you everyone who tried to help me solve this dilemma! :)

ADD REPLY
0
Entering edit mode

It could be that the DB was created with ncbi-blast from a pre-blast+ version and some quirk messes with header retrieval between versions. It's pretty mysterious, you may want to email NCBI with your findings.

ADD REPLY
0
Entering edit mode

I don't know... sometimes replacing spaces with underscores is exactly what I seem to be doing all day...

ADD REPLY

Login before adding your answer.

Traffic: 2553 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6