Using bedtools2, I extracted the CDS fasta from a GFF3 file and it's reference fasta. I then attempted to create a local database using blast command line and the 'makeblastdb' command. This failed due to a large number of duplicated sequences in the file. I then ran the fasta through the sequence cleaner python script found here:
Now, with my newly cleaned fasta, I am again attempting to create a custom database. In windows command prompt, running the command:
makeblastdb -in clear_transcriptome.fa -out clear_transcriptome -dbtype nucl -parse_seqids
Causes a window to open saying "makeblastdb.exe has stopped working".
I have attempted a fresh install of Blast command line, and have successfully built a database with a different fasta file (that has worked in the past), but this error is still occurring.
Additionally, running the exact same command WITHOUT the '-parse_seqids' option successfully builds a database, but not one that I can blast against (tblastn returns "No alias or index file found for nucleotide database").
For refrenece, here is the format of the first few sequences in my clear_transcriptome.fa file (the ellipses are only for space conservation, as this post is long enough already):
>exon::Scaffold2376:18278-18883 NNNNNNNNNNNNNNNN ... >gene::Scaffold1190:58965-85903_mRNA::Scaffold1190:58965-85903 AGAAGGTGCAGGGCTG ... >exon::Scaffold2694:84739-84921_CDS::Scaffold2694:84739-84921 ATGAAGTTGAACGTTATA ... >exon::Scaffold50:750576-750666_CDS::Scaffold50:750576-750666_exon::Scaffold50:750576-750666_CDS::Scaffold50:750576-750666_exon::Scaffold50:750576-750666_CDS::Scaffold50:750576-750666_exon::Scaffold50:750576-750666_CDS::Scaffold50:750576-750666 GAGCAGCACTCAGTAGAA ...
I will admit it is not the cleanest, but that is a result of the sequence cleaner python script.
Any idea as to what about the sequence ids is causing the crashes? Could it be the excessive length of some of them? If so, why would that cause this issue? If any more information is needed let me know.
Thanks in advance.