Question: makeblastdb.exe Crashes when using a cleaned FASTA file
0
gravatar for friedrichlab
18 months ago by
friedrichlab0 wrote:

Using bedtools2, I extracted the CDS fasta from a GFF3 file and it's reference fasta. I then attempted to create a local database using blast command line and the 'makeblastdb' command. This failed due to a large number of duplicated sequences in the file. I then ran the fasta through the sequence cleaner python script found here:

http://biopython.org/wiki/Sequence_Cleaner

Now, with my newly cleaned fasta, I am again attempting to create a custom database. In windows command prompt, running the command:

makeblastdb -in clear_transcriptome.fa -out clear_transcriptome -dbtype nucl -parse_seqids

Causes a window to open saying "makeblastdb.exe has stopped working".

I have attempted a fresh install of Blast command line, and have successfully built a database with a different fasta file (that has worked in the past), but this error is still occurring.

Additionally, running the exact same command WITHOUT the '-parse_seqids' option successfully builds a database, but not one that I can blast against (tblastn returns "No alias or index file found for nucleotide database").

For refrenece, here is the format of the first few sequences in my clear_transcriptome.fa file (the ellipses are only for space conservation, as this post is long enough already):


>exon::Scaffold2376:18278-18883

NNNNNNNNNNNNNNNN ...

>gene::Scaffold1190:58965-85903_mRNA::Scaffold1190:58965-85903

AGAAGGTGCAGGGCTG ...

>exon::Scaffold2694:84739-84921_CDS::Scaffold2694:84739-84921

ATGAAGTTGAACGTTATA ...

>exon::Scaffold50:750576-750666_CDS::Scaffold50:750576-750666_exon::Scaffold50:750576-750666_CDS::Scaffold50:750576-750666_exon::Scaffold50:750576-750666_CDS::Scaffold50:750576-750666_exon::Scaffold50:750576-750666_CDS::Scaffold50:750576-750666

GAGCAGCACTCAGTAGAA ...

I will admit it is not the cleanest, but that is a result of the sequence cleaner python script.

Any idea as to what about the sequence ids is causing the crashes? Could it be the excessive length of some of them? If so, why would that cause this issue? If any more information is needed let me know.

Thanks in advance.

ADD COMMENTlink modified 18 months ago • written 18 months ago by friedrichlab0
1
gravatar for genomax
18 months ago by
genomax62k
United States
genomax62k wrote:

Could it be the excessive length of some of them?

That could be one of the issues. You could shorten the headers by using something here: Fasta header trimming

ADD COMMENTlink written 18 months ago by genomax62k

That worked perfectly! Thank you very much!

ADD REPLYlink written 18 months ago by friedrichlab0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 629 users visited in the last hour