How to convert a database from protein to nucleotide
0
0
Entering edit mode
4.9 years ago

Hi! I'm fairly new at UNIX and Bioinformatical work. I am taking a class now that is focusing on using UNIX and BLAST on nucleotide sequences. When the databases were made, we have been using blastn, constantly. I wanted to apply some of what I learned to my own research and realized that my databases are now protein databases, even though the fasta file is ALL nucleotides. Is there a way to convert the database from protein to nucleotide?

blastx blastn bioinformatics unix blast • 2.0k views
1
Entering edit mode

Due to redundancy in codon usage for multiple amino acids a protein sequence does not uniquely identify a nucleotide sequence.

0
Entering edit mode

If you only have the index files (i.e. no fasta protein sequence) then you would need to use blastdbcmd utility to first recover the fasta sequence.

Then you can use back-translation tools like backtranseq from EMBOSS (http://www.ebi.ac.uk/Tools/st/). Ideally if you know what genome those proteins are from then you could go and get the DNA sequence from source instead.

0
Entering edit mode

The genome is from Listeria monocytogenes, and I have the fasta file for a contig of this genome, but for some reason it recognizes it as a protein and not nucleotide file. Even though the first couple lines are: AGATTCCTTGCGTCAAATTGACTTCGCTAGCAATTAAATTACTAGTTTGTTTTGTTGAAAACAGCTTTCT GTTTTCTGCCCTGCGATTACCAGTGAGACTTTACGTCTCATTGCTTTTCGTCTTCTTCTTTGTTCAGTTT TCAAAGGTCAGTTGCTTTGTTAACGCAACTTTTAAATCTTACCATAAAGTTGAAATCACGTCAACAACTA

1
Entering edit mode

That is definitely not protein and if you started at the top of the file it is not in fasta format either.

but for some reason it recognizes it as a protein and not nucleotide file.

What is doing that? Blast? Have you created indexes for this dataset already?

0
Entering edit mode

So when I made the databases by makeblastdb -dbtype prot -in LM_R8_5081_contig10.fasta -out LM_R8_5081_contig10 -parse_seqids Then the report after says that it is a protein: New DB title: LM_R8_5081_contig10.fasta Sequence type: Protein Deleted existing Protein BLAST database named /home/ajt3/Listeria_Work/LM_R8_5081_contig10_test Keep Linkouts: T Keep MBits: T Maximum file size: 1000000000B

0
Entering edit mode

Nope, I'm dumb, I realized what it is...

0
Entering edit mode

Glad you figured the problem out yourself :)

0
Entering edit mode

realized that my databases are now protein databases, even though the fasta file is ALL nucleotides.

It sounds like the database you have, is a nucleotide fasta, but when you made the db, it was made as a protein database, so the extensions are incorrect for the indexed files.

Do your db files end with the following?

.phr, .pin, .psq

remake your database with this command:

makeblastdb -in yourfastfile.fasta -dbtype nucl


and your db will be made with the correct extensions for blastn. Your files will not end with

.nhr, .nin, .nsq