Question

Getting untranslated nucleotide sequences on tblastn standalone

1

Entering edit mode

10.1 years ago

vicks ▴ 10

Short version:

Does anybody know how to get untranslated DNA sequences from tblastn in local mode?

Long version:

This is trivial to do online at NCBI, you just select several target sequences and download them, and what you get are the nucleotide sequences.

However, when running tblastn in standalone mode the output is always translated, and I can't find a way of getting the nucleotide sequences directly.

Somebody asked a similar question here, but I don't think it ever got answered (one of the answers claimed it was a problem with the XML output, but indeed I find it with every output I've tried: they either have protein sequences, or no sequences at all).

Naturally, I could extract the nucleotide sequences one by one from the identifier, start, end, and frame, but one would think there should be an easier (automatic) way.

blast tblastn • 4.6k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.1 years ago by vicks ▴ 10

Ram · Answer 1 · 2015-10-06

I quickly wrote a tool to retrieve the DNA sequence from a BLAST+XML output: https://github.com/lindenb/jvarkit/wiki/Biostar160470 (note that I didn't pay attention to the correct position for the junctions where there is a gap )

e.g.:

cat roxan.fa | ${bin.dir}/tblastn -db blastdb -outfmt 5 | java -jar biostar160470.jar -p ${bin.dir} -d blastdb| xmllint --format -

Output:

(...)
              <Hsp>
                <Hsp_num>5</Hsp_num>
                <Hsp_bit-score>31.9574</Hsp_bit-score>
                <Hsp_score>71</Hsp_score>
                <Hsp_evalue>0.000226217</Hsp_evalue>
                <Hsp_query-from>520</Hsp_query-from>
                <Hsp_query-to>575</Hsp_query-to>
                <Hsp_hit-from>1711</Hsp_hit-from>
                <Hsp_hit-to>1860</Hsp_hit-to>
                <Hsp_query-frame>0</Hsp_query-frame>
                <Hsp_hit-frame>1</Hsp_hit-frame>
                <Hsp_identity>16</Hsp_identity>
                <Hsp_positive>27</Hsp_positive>
                <Hsp_gaps>6</Hsp_gaps>
                <Hsp_align-len>56</Hsp_align-len>
                <Hsp_qseq>MGEFRLCDRLQKGKACPDGDKCRCAHGQEELNEWLDRREVLKQKLAKARKDMLLCP</Hsp_qseq>
                <Hsp_hseq>VGSYYLCKDMINKQDCKYGDNCTFAYHQEEIDVWTEERK------GTLNRDLLFDP</Hsp_hseq>
                <Hsp_midline>+G + LC  +   + C  GD C  A+ QEE++ W + R+          +D+L  P</Hsp_midline>
                <Hsp_hit-DNA>GTGGGCTCCTACTACCTGTGCAAAGACATGATTAACAAGCAGGACTGTAAGTACGGGGATAACTGCACCTTCGCCTACCATCAGGAGGAGATCGACGTGTGGACCGAGGAGCGGAAG------------------CTGCTCTTCGACCCG</Hsp_hit-DNA>
              </Hsp>
              <Hsp>
                <Hsp_num>6</Hsp_num>
                <Hsp_bit-score>22.3274</Hsp_bit-score>
                <Hsp_score>46</Hsp_score>
                <Hsp_evalue>0.215374</Hsp_evalue>
                <Hsp_query-from>22</Hsp_query-from>
                <Hsp_query-to>62</Hsp_query-to>
                <Hsp_hit-from>3316</Hsp_hit-from>
                <Hsp_hit-to>3435</Hsp_hit-to>
                <Hsp_query-frame>0</Hsp_query-frame>
                <Hsp_hit-frame>1</Hsp_hit-frame>
                <Hsp_identity>15</Hsp_identity>
                <Hsp_positive>19</Hsp_positive>
                <Hsp_gaps>1</Hsp_gaps>
                <Hsp_align-len>41</Hsp_align-len>
                <Hsp_qseq>HEAPWTNLTPSWRRPTHRTTVPLAVLRNQPPRQSPACPTLP</Hsp_qseq>
                <Hsp_hseq>HQAAPSPLRPCPSSPHHRPGVRTQAHVLQPP-EAPLKPGLP</Hsp_hseq>
                <Hsp_midline>H+A  + L P    P HR  V       QPP ++P  P LP</Hsp_midline>
                <Hsp_hit-DNA>CATCAGGCAGCCCCCAGCCCCCTGAGGCCCTGTCCATCTTCTCCCCACCACCGCCCCGGTGTGCGTACCCAGGCGCACGTGCTGCAGCCCCCG---GCCCCGCTGAAACCTGGGCTGCCC</Hsp_hit-DNA>
              </Hsp>

Ram · Answer 2 · 2015-10-06

You could parse identifier, range and strand info into a file and then use blastdbcmd?

 -entry_batch <File_In>
   Input file for batch processing (Format: one entry per line, seq id
   followed by optional space-delimited specifier(s)
   [range|strand|mask_algo_id]

 -outfmt <String>
   Output format, where the available format specifiers are:
           %f means sequence in FASTA format
           %s means sequence data (without defline)
           %a means accession
           %g means gi
           %o means ordinal id (OID)
           %i means sequence id
           %t means sequence title
           %l means sequence length
           %h means sequence hash value
           %T means taxid
           %e means membership integer
           %L means common taxonomic name
           %S means scientific name
           %P means PIG
           %m means sequence masking data.
              Masking data will be displayed as a series of 'N-M' values
              separated by ';' or the word 'none' if none are available.
       If '%f' is specified, all other format specifiers are ignored.
       For every format except '%f', each line of output will correspond
       to a sequence.
   Default = `%f'