Getting untranslated nucleotide sequences on tblastn standalone
2
1
Entering edit mode
8.6 years ago
vicks ▴ 10

Short version:

Does anybody know how to get untranslated DNA sequences from tblastn in local mode?

Long version:

This is trivial to do online at NCBI, you just select several target sequences and download them, and what you get are the nucleotide sequences.

However, when running tblastn in standalone mode the output is always translated, and I can't find a way of getting the nucleotide sequences directly.

Somebody asked a similar question here, but I don't think it ever got answered (one of the answers claimed it was a problem with the XML output, but indeed I find it with every output I've tried: they either have protein sequences, or no sequences at all).

Naturally, I could extract the nucleotide sequences one by one from the identifier, start, end, and frame, but one would think there should be an easier (automatic) way.

blast tblastn • 3.9k views
ADD COMMENT
1
Entering edit mode
8.6 years ago

I quickly wrote a tool to retrieve the DNA sequence from a BLAST+XML output: https://github.com/lindenb/jvarkit/wiki/Biostar160470 (note that I didn't pay attention to the correct position for the junctions where there is a gap )

e.g.:

cat roxan.fa | ${bin.dir}/tblastn -db blastdb -outfmt 5 | java -jar biostar160470.jar -p ${bin.dir} -d blastdb| xmllint --format -

Output:

(...)
              <Hsp>
                <Hsp_num>5</Hsp_num>
                <Hsp_bit-score>31.9574</Hsp_bit-score>
                <Hsp_score>71</Hsp_score>
                <Hsp_evalue>0.000226217</Hsp_evalue>
                <Hsp_query-from>520</Hsp_query-from>
                <Hsp_query-to>575</Hsp_query-to>
                <Hsp_hit-from>1711</Hsp_hit-from>
                <Hsp_hit-to>1860</Hsp_hit-to>
                <Hsp_query-frame>0</Hsp_query-frame>
                <Hsp_hit-frame>1</Hsp_hit-frame>
                <Hsp_identity>16</Hsp_identity>
                <Hsp_positive>27</Hsp_positive>
                <Hsp_gaps>6</Hsp_gaps>
                <Hsp_align-len>56</Hsp_align-len>
                <Hsp_qseq>MGEFRLCDRLQKGKACPDGDKCRCAHGQEELNEWLDRREVLKQKLAKARKDMLLCP</Hsp_qseq>
                <Hsp_hseq>VGSYYLCKDMINKQDCKYGDNCTFAYHQEEIDVWTEERK------GTLNRDLLFDP</Hsp_hseq>
                <Hsp_midline>+G + LC  +   + C  GD C  A+ QEE++ W + R+          +D+L  P</Hsp_midline>
                <Hsp_hit-DNA>GTGGGCTCCTACTACCTGTGCAAAGACATGATTAACAAGCAGGACTGTAAGTACGGGGATAACTGCACCTTCGCCTACCATCAGGAGGAGATCGACGTGTGGACCGAGGAGCGGAAG------------------CTGCTCTTCGACCCG</Hsp_hit-DNA>
              </Hsp>
              <Hsp>
                <Hsp_num>6</Hsp_num>
                <Hsp_bit-score>22.3274</Hsp_bit-score>
                <Hsp_score>46</Hsp_score>
                <Hsp_evalue>0.215374</Hsp_evalue>
                <Hsp_query-from>22</Hsp_query-from>
                <Hsp_query-to>62</Hsp_query-to>
                <Hsp_hit-from>3316</Hsp_hit-from>
                <Hsp_hit-to>3435</Hsp_hit-to>
                <Hsp_query-frame>0</Hsp_query-frame>
                <Hsp_hit-frame>1</Hsp_hit-frame>
                <Hsp_identity>15</Hsp_identity>
                <Hsp_positive>19</Hsp_positive>
                <Hsp_gaps>1</Hsp_gaps>
                <Hsp_align-len>41</Hsp_align-len>
                <Hsp_qseq>HEAPWTNLTPSWRRPTHRTTVPLAVLRNQPPRQSPACPTLP</Hsp_qseq>
                <Hsp_hseq>HQAAPSPLRPCPSSPHHRPGVRTQAHVLQPP-EAPLKPGLP</Hsp_hseq>
                <Hsp_midline>H+A  + L P    P HR  V       QPP ++P  P LP</Hsp_midline>
                <Hsp_hit-DNA>CATCAGGCAGCCCCCAGCCCCCTGAGGCCCTGTCCATCTTCTCCCCACCACCGCCCCGGTGTGCGTACCCAGGCGCACGTGCTGCAGCCCCCG---GCCCCGCTGAAACCTGGGCTGCCC</Hsp_hit-DNA>
              </Hsp>
ADD COMMENT
0
Entering edit mode
8.6 years ago
5heikki 11k

You could parse identifier, range and strand info into a file and then use blastdbcmd?

 -entry_batch <File_In>
   Input file for batch processing (Format: one entry per line, seq id
   followed by optional space-delimited specifier(s)
   [range|strand|mask_algo_id]

 -outfmt <String>
   Output format, where the available format specifiers are:
           %f means sequence in FASTA format
           %s means sequence data (without defline)
           %a means accession
           %g means gi
           %o means ordinal id (OID)
           %i means sequence id
           %t means sequence title
           %l means sequence length
           %h means sequence hash value
           %T means taxid
           %e means membership integer
           %L means common taxonomic name
           %S means scientific name
           %P means PIG
           %m means sequence masking data.
              Masking data will be displayed as a series of 'N-M' values
              separated by ';' or the word 'none' if none are available.
       If '%f' is specified, all other format specifiers are ignored.
       For every format except '%f', each line of output will correspond
       to a sequence.
   Default = `%f'
ADD COMMENT

Login before adding your answer.

Traffic: 1809 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6