Question: 'CDS' but not 'exon' in GFF
gravatar for NielQC
22 months ago by
NielQC10 wrote:


I'm working with RNAseq data from A. thaliana. Reads were mapped with STAR 2.6 using the TAIR10.1 genome from NCBI and its annotation in GFF3 format (link).

I have noticed some of the genes have not any 'exon' feature line and go directly from the 'gene' line to the 'CDS' line, and this is a problem since I would like to count the 'exon' features. I have always thought that a 'CDS' has to have a related 'exon'.

In lines below, gene38184 is a tRNA and has 'exon' features, however gene38185 and gene38186 (protein codificant) have not.

NC_000932.1 RefSeq  gene    1717    4347    .   -   .   ID=gene38184;Dbxref=GeneID:1466272;Name=trnK;gbkey=Gene;gene=trnK;gene_biotype=tRNA;locus_tag=ArthCt089
NC_000932.1 RefSeq  tRNA    1717    4347    .   -   .   ID=rna60172;Parent=gene38184;Dbxref=GeneID:1466272;gbkey=tRNA;gene=trnK;product=tRNA-Lys
NC_000932.1 RefSeq  exon    4311    4347    .   -   .   ID=id325622;Parent=rna60172;Dbxref=GeneID:1466272;gbkey=tRNA;gene=trnK;product=tRNA-Lys
NC_000932.1 RefSeq  exon    1717    1751    .   -   .   ID=id325623;Parent=rna60172;Dbxref=GeneID:1466272;gbkey=tRNA;gene=trnK;product=tRNA-Lys
NC_000932.1 RefSeq  gene    2056    3570    .   -   .   ID=gene38185;Dbxref=GeneID:844797;Name=matK;gbkey=Gene;gene=matK;gene_biotype=protein_coding;locus_tag=ArthCp003
NC_000932.1 RefSeq  CDS 2056    3570    .   -   0   ID=cds48182;Parent=gene38185;Dbxref=Genbank:NP_051040.2,GeneID:844797;Name=NP_051040.2;gbkey=CDS;gene=matK;product=maturase K;protein_id=NP_051040.2;transl_table=11
NC_000932.1 RefSeq  gene    5084    6188    .   -   .   ID=gene38186;Dbxref=GeneID:844798;Name=rps16;gbkey=Gene;gene=rps16;gene_biotype=protein_coding;locus_tag=ArthCp004
NC_000932.1 RefSeq  CDS 6149    6188    .   -   0   ID=cds48183;Parent=gene38186;Dbxref=Genbank:NP_051041.1,GeneID:844798;Name=NP_051041.1;gbkey=CDS;gene=rps16;product=ribosomal protein S16;protein_id=NP_051041.1;transl_table=11
NC_000932.1 RefSeq  CDS 5084    5283    .   -   2   ID=cds48183;Parent=gene38186;Dbxref=Genbank:NP_051041.1,GeneID:844798;Name=NP_051041.1;gbkey=CDS;gene=rps16;product=ribosomal protein S16;protein_id=NP_051041.1;transl_table=11

This is a problem since I would like to use 'exon' as a feature to count.

Is this normal? Should I use other feature to obtain the counts?

Thanks in advance.

rna-seq exon gff cds • 1.5k views
ADD COMMENTlink modified 22 months ago by tdmurphy190 • written 22 months ago by NielQC10

I am not a plant guy, but checking these two examples, both are chloroplast genes, so not on the nuclear genome, e.g. gene38186.

ADD REPLYlink written 22 months ago by ATpoint40k

You are right, NC_000932.1 is the chloroplast chromosome. The case I describe also happens in the mitochondrial chromosome, NC_037304.1. With the usual proceeding of counting 'exon' features these genes would be lost. In fact I'm not sure if people usually include these chromosomes in the analysis.

ADD REPLYlink written 22 months ago by NielQC10

chrM is typically not included in gene expression analysis (at least I do not do it for human and mouse). The reason that you do not have "exon" is probably the prokaryotic origin of these organelles which have operon rather than intron/exon structures.

ADD REPLYlink written 22 months ago by ATpoint40k
gravatar for tdmurphy
22 months ago by
tdmurphy190 wrote:

Organelles and prokaryote genomes in both GenBank and RefSeq are typically annotated with only CDS and no corresponding mRNA features, so the GFF3 has no exon features. This is mentioned briefly under "ANNOTATION DATA MODEL" in:

ADD COMMENTlink written 22 months ago by tdmurphy190
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1028 users visited in the last hour