How To Check Sequence Type While Downloading
2
0
Entering edit mode
13.0 years ago
Mateusz ▴ 70

Hi,

Sometimes I need to download, let's say around 3k sequences, using for this purpose Gi list. I know how to download them, the problem lies elsewhere. This gi list is generated by another script and sometimes among gi's related to protein sequences it contains also gi's of DNA (or perhaps RNA) sequences. Since I am interested only in protein sequences - is there a way to download only protein sequences?

I am using biopython but I haven't seen in SeqIO any variable which could describe type of sequence. Another option could be filtering fasta after download. Ofc I could iterate through whole sequence and check if it contain letters different than ATGCU, but this is not so elegant, and I think it will work slow. Do you know any better solutions?

Thanks in advance!

Cheers, Mateusz

sequence scripting biopython python • 2.5k views
ADD COMMENT
3
Entering edit mode
13.0 years ago

Instead of downloading the FASTA file, you could first download the XML (TinySeq) descriptions from the NCBI. The TSeq_seqtype tag contains the "sequence type" attribute.


 http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd">
 <TSeqSet>
    <TSeq>
        <TSeq_seqtype value="nucleotide"/>
        <TSeq_gi>25</TSeq_gi>
        <TSeq_accver>X53813.1</TSeq_accver>
        <TSeq_taxid>9771</TSeq_taxid>
        <TSeq_orgname>Balaenoptera musculus</TSeq_orgname>
        <TSeq_defline>Blue Whale heavy satellite DNA</TSeq_defline>
        <TSeq_length>422</TSeq_length>
        <TSeq_sequence>TAGTTATT(...)
    </TSeq>
</TSeqSet>

you can then transform your sequences back to fasta using XSLT:


<xsl:stylesheet xmlns:xsl="&lt;a href=" http:="" www.w3.org="" 1999="" XSL="" Transform"="" rel="nofollow">http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text"/>
  <xsl:template match="/">
    <xsl:for-each select="//TSeq[TSeq_seqtype/@value='nucleotide']">
      <xsl:value-of select="concat('&gt;gi|',TSeq_gi,'|',TSeq_accver,'|',TSeq_defline)"/>
      <xsl:text>
</xsl:text>
      <xsl:value-of select="TSeq_sequence"/>
      <xsl:text>
</xsl:text>
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>

$ xsltproc stylesheet.xsl sequences.xml
>gi|25|X53813.1|Blue Whale heavy satellite DNA
TAGTTATTCAACCTATCCCACTCTCTAGATACCCCTTAGCACGTAAAGGAATATTATTTGGGGGTCCAGCCATGGAGAATAGTTTAGACACTAGGATGAGATAAGGAACACACCCATTCTAAAGAAATCACATTAGGATTCTCTTTTTAAGCTGTTCCTTAAAACACTAGAGTCTTAGAAATCTATTGGAGGCAGAAGCAGTCAAGGGTAGCCTAGGGTTAGGGTTAGGCTTAGGGTTAGGGTTAGGGTACGGCTTAGGGTACTGTTTCGGGGAGGGGTTCAGGTACGGCGTAGGGTATGGGTTAGGGTTAGGGTTAGGGTTAGTGTTAGGGTTAGGGCTCGGTTTAGGGTACGGGTTAGGATTAGGGTACGTGTTAGGGTTAGGGTAGGGCTTAGGGTTAGGGTACGTGTTAGGGTTAGGG
ADD COMMENT
0
Entering edit mode

Seems to be pretty good solution. I assume that I could download it (without saving on hdd), check type, and if protein transform to fasta in the fly and save on hdd.

Thanks. :)

ADD REPLY
3
Entering edit mode
13.0 years ago
Leszek 4.2k

You could fetch XML instead of fasta (contain detailed sequence description and sequence) using Biopython, but xml is a few magnitudes of size bigger! Have a look:

[?]

So I think, it's reasonable to get fasta and write some tiny function to recognize if you deal with dna or protein, let's say by looking at first 20-30 positions.

It's going to be faster than getting 100x more data for sure;)

ADD COMMENT
0
Entering edit mode

I don't know if that is to risky, but it should be efficient to check only first letter, since I am downloading whole sequences it should start with "M".

ADD REPLY
0
Entering edit mode

Checking only first letter is risky, as some proteins are post-processed and leading M is removed.

ADD REPLY
0
Entering edit mode

I've implemented check for first 20 residues (faster than playing around with XML) and it seems to work fine. Thanks!

ADD REPLY
0
Entering edit mode

No problem. In fact, this is how Multiple Sequence Aligners guess whether input is dna or protein sequence (i.e. Muscle). btw: are you using Entrez.EPost? it's much faster. have a look: A: Speed Of Efetch In Biopython

ADD REPLY

Login before adding your answer.

Traffic: 1295 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6