Question: 'CDS' but not 'exon' in GFF
gravatar for NielQC
4 weeks ago by
NielQC10 wrote:


I'm working with RNAseq data from A. thaliana. Reads were mapped with STAR 2.6 using the TAIR10.1 genome from NCBI and its annotation in GFF3 format (link).

I have noticed some of the genes have not any 'exon' feature line and go directly from the 'gene' line to the 'CDS' line, and this is a problem since I would like to count the 'exon' features. I have always thought that a 'CDS' has to have a related 'exon'.

In lines below, gene38184 is a tRNA and has 'exon' features, however gene38185 and gene38186 (protein codificant) have not.

NC_000932.1 RefSeq  gene    1717    4347    .   -   .   ID=gene38184;Dbxref=GeneID:1466272;Name=trnK;gbkey=Gene;gene=trnK;gene_biotype=tRNA;locus_tag=ArthCt089
NC_000932.1 RefSeq  tRNA    1717    4347    .   -   .   ID=rna60172;Parent=gene38184;Dbxref=GeneID:1466272;gbkey=tRNA;gene=trnK;product=tRNA-Lys
NC_000932.1 RefSeq  exon    4311    4347    .   -   .   ID=id325622;Parent=rna60172;Dbxref=GeneID:1466272;gbkey=tRNA;gene=trnK;product=tRNA-Lys
NC_000932.1 RefSeq  exon    1717    1751    .   -   .   ID=id325623;Parent=rna60172;Dbxref=GeneID:1466272;gbkey=tRNA;gene=trnK;product=tRNA-Lys
NC_000932.1 RefSeq  gene    2056    3570    .   -   .   ID=gene38185;Dbxref=GeneID:844797;Name=matK;gbkey=Gene;gene=matK;gene_biotype=protein_coding;locus_tag=ArthCp003
NC_000932.1 RefSeq  CDS 2056    3570    .   -   0   ID=cds48182;Parent=gene38185;Dbxref=Genbank:NP_051040.2,GeneID:844797;Name=NP_051040.2;gbkey=CDS;gene=matK;product=maturase K;protein_id=NP_051040.2;transl_table=11
NC_000932.1 RefSeq  gene    5084    6188    .   -   .   ID=gene38186;Dbxref=GeneID:844798;Name=rps16;gbkey=Gene;gene=rps16;gene_biotype=protein_coding;locus_tag=ArthCp004
NC_000932.1 RefSeq  CDS 6149    6188    .   -   0   ID=cds48183;Parent=gene38186;Dbxref=Genbank:NP_051041.1,GeneID:844798;Name=NP_051041.1;gbkey=CDS;gene=rps16;product=ribosomal protein S16;protein_id=NP_051041.1;transl_table=11
NC_000932.1 RefSeq  CDS 5084    5283    .   -   2   ID=cds48183;Parent=gene38186;Dbxref=Genbank:NP_051041.1,GeneID:844798;Name=NP_051041.1;gbkey=CDS;gene=rps16;product=ribosomal protein S16;protein_id=NP_051041.1;transl_table=11

This is a problem since I would like to use 'exon' as a feature to count.

Is this normal? Should I use other feature to obtain the counts?

Thanks in advance.

rna-seq exon gff cds • 176 views
ADD COMMENTlink modified 17 days ago by tdmurphy160 • written 4 weeks ago by NielQC10

I am not a plant guy, but checking these two examples, both are chloroplast genes, so not on the nuclear genome, e.g. gene38186.

ADD REPLYlink written 4 weeks ago by ATpoint12k

You are right, NC_000932.1 is the chloroplast chromosome. The case I describe also happens in the mitochondrial chromosome, NC_037304.1. With the usual proceeding of counting 'exon' features these genes would be lost. In fact I'm not sure if people usually include these chromosomes in the analysis.

ADD REPLYlink written 4 weeks ago by NielQC10

chrM is typically not included in gene expression analysis (at least I do not do it for human and mouse). The reason that you do not have "exon" is probably the prokaryotic origin of these organelles which have operon rather than intron/exon structures.

ADD REPLYlink written 4 weeks ago by ATpoint12k
gravatar for tdmurphy
17 days ago by
tdmurphy160 wrote:

Organelles and prokaryote genomes in both GenBank and RefSeq are typically annotated with only CDS and no corresponding mRNA features, so the GFF3 has no exon features. This is mentioned briefly under "ANNOTATION DATA MODEL" in:

ADD COMMENTlink written 17 days ago by tdmurphy160
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1134 users visited in the last hour