Question: Tair Seems To Show Incorrect Annotation For A Spliced Gene
0
gravatar for Ritvik
5.3 years ago by
Ritvik30
Ritvik30 wrote:

Hi,

I can't seem to understand how come for a single gene ATMG00060.1, TAIR seems to show different CDS and cDNA sequence as both the CDS and cDNA have the same sequence length and the gene contains seemingly no 5' UTR sequence?

Corresponding links are as follows:

Gene ATMG00060.1: http://arabidopsis.org/servlets/TairObject?id=1000647816&type=gene

CDS : http://arabidopsis.org/servlets/TairObject?type=sequence&id=1002472305

cDNA : http://arabidopsis.org/servlets/TairObject?type=sequence&id=2002989388

Can anyone explain what's happening here?

Another very general question about splicing order:

Suppose my gene has two exons:

Exon 1's position is: complement[21691:22086] as i have the DNA sequence of the opposite strand

Exon 2's position is: complement [20570:20717] as i have the DNA sequence of the opposite strand

So which splicing order is correct :

Final spliced mRNA = Reverse complement of (Exon1 + Exon2) or Final spliced mRNA = Reverse complement of (Exon2 + Exon1)

Also, can anyone expand on the reason why in removing alternative splice variants, the one bearing longest CDS is selected for? Is there any relationship between CDS length and mRNA stability?

mrna cds splicing • 1.6k views
ADD COMMENTlink modified 5.3 years ago by Istvan Albert ♦♦ 80k • written 5.3 years ago by Ritvik30
1
gravatar for Istvan Albert
5.3 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

You should put the link to your gene and not the sequences, it is too hard to see if they are right or not. Edit your post, remove the sequence and add links to the gene.

In general one has to be cautious with these terms - these are not always used properly even by data sources. One can consult the Sequence Ontology for reference where it says that the definition of CDS is

A contiguous sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon.

http://www.sequenceontology.org/browser/current_svn/term/SO:0000316

The definition of mRNA is:

Messenger RNA is the intermediate molecule between DNA and protein. It includes UTR and coding sequences. It does not contain introns.

http://www.sequenceontology.org/browser/current_svn/term/SO:0000234

ADD COMMENTlink written 5.3 years ago by Istvan Albert ♦♦ 80k

Thanks for bothering to explain! I thought that may be if i gave all the information herein itself, it would be much easier to understand my question. Ok, i will now try to reframe the question.

ADD REPLYlink written 5.3 years ago by Ritvik30
1
gravatar for Istvan Albert
5.3 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

This is not about the correct splicing order but what do the terms CDS and mRNA actually mean.

As you note of course it can be greatly mislending and confusing and most likely millions of research dollars go to waste due to errors that these cause.

Usually bioinformatics representations that work off of coordinates will produce outputs that match the forward strand: for example the start coordinate is always the smaller number, even though the actual start from biological sense may be the higher coordinate. Representations that are sequence oriented (like mRNA) obey the correct directionality.

In this case it appears that the TAIR system will produce CDS in the coordinate representation whereas the mRNA represents the actual product.

In general I try to avoid working with CDS as it is almost never fully clear what someone means by that.

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by Istvan Albert ♦♦ 80k

Thanks again for replying! If mRNAs are deemed to be more accurately represented, then what is the best method to extract CDS information from a gene on a genomic level?

Actually, i was trying to extract CDS and mRNA information from a chromosome genbank file but there were some less than 5% genes whose sequence didn't match like the one in this question. Is 5% an acceptable error rate or i am doing something fundamentally wrong here?

ADD REPLYlink written 5.3 years ago by Ritvik30
1

With biology we always have to be careful with the terminology thus this all comes down to what the word CDS actually means. The problem is usually (as above) that a site like TAIR gives you the CDS but does not tell you what in their interpretation CDS is.

Then as it is always almost the case there are clearly cases when the data does not seem to match. Could be errors or some type of conflicting information (that is not shown) made it so that a decision had to be made that ends up diverging.

Often it is easier to operate on coordinates rather than sequences as in those cases you can better see what each file is supposed to represent. So I would recommend to find the coordinates and use either bedtools getfasta if you have a BED12 file or gffread program if you have a GFF file to extract the sequences that you need.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Istvan Albert ♦♦ 80k

Ok, Will try what you have suggested.Once again, thanks for your help!

ADD REPLYlink written 5.3 years ago by Ritvik30

Hello,

I apologise for reviving a somewhat old topic.

If I use gffread for genomic features labelled as existing on the negative strand, will gffread find the reverse complement automatically when extracting the sequence from the fasta file, or will I have to implement an additional step to get that?

ADD REPLYlink written 2.9 years ago by Thomas Bradley90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1004 users visited in the last hour