Hi everyone,
I am relatively new to DNA sequencing data and have a quick question. I have sequencing data and a custom database that was developed for a previous project in my lab. I have successfully ran split_libraries, and sorted out all of the good reads from my NGS run, however when I BLAST the data against the custom database it only ever gives the first part of the scientific name of the species. I have been looking up solutions to this for the past month, and have finally figured out that I think the database has spaces in the titles (which is why it is breaking every time it hits that first space in the scientific name). The person that created this database is no longer in the lab and I do not want to have to re-create the entire thing. Is there a way to change the headings of each sequence in the database file to replace the spaces with underscores, or something similar so that I can use the same blastn command I have been trying to use? I am assuming there is a sed way to do this, however I haven't been able to figure it out.
Thanks!
Did you try googling
sed replace space with underscore
? Also, this is not strictly bioinformatics (it's got nothing to do with ngs, it's simply replacing spaces with underscores), so the question might get closed.Hi, I have tried that but everything that came up was just removing spaces from a sentence, and not altering an NGS database
"NGS database" is not a term used in the community to denote any particular idea, so I'm not sure what you're talking about. You seem to be working with a custom BLAST database. I don't think you can modify a database, especially its identifiers after it is created, so you will need to either recreate the database or tailor your query to fetch all the information the database has on the sequence.
Yes, sorry I am referring to a custom BLAST database and I have tried many different outfmt parameters to get the entire title of each sequence in the database, however nothing is working. It could be something completely different than spaces in the database headers, but this is what I am guessing is the issue. I have tried (among other things):
and none of these are giving all information for the sequences in the database. Am I blindly overlooking something? Thanks!
That's strange. I don't think
makeblastdb
allows characters it cannot parse, and I don't think it allows duplicate identifiers either. Are you sure there is more to the subject header than is being pulled out? Try all subject accessors, not justsalltitles
. Trysgi
,sacc
andsseqid
as well.Yes, I tried typing in a list of 10 or so different options for 6 and it wasn't working. However, I played around with it yesterday with a sub-sample of sequences and it seemed to give the entire header if I used blastall and called blastn, rather than just running blastn... Not sure why the normal method wasn't working for me! Thank you everyone who tried to help me solve this dilemma! :)
It could be that the DB was created with ncbi-blast from a pre-blast+ version and some quirk messes with header retrieval between versions. It's pretty mysterious, you may want to email NCBI with your findings.
I don't know... sometimes replacing spaces with underscores is exactly what I seem to be doing all day...