How are duplicated genes named under GTF file?
1
0
Entering edit mode
8 months ago
Petesview ▴ 10

Hi,

Some genes in the genome are known to be duplicated, hence there are multiple copies of the same protein-coding sequence but at different loci. My question is, how are these duplicated genes named under GENCODE or Refseq annotation (gtf or gff file)? Do these have exact same gene names, but with numerical suffices (eg. Gene_A.1 and Gene_A.2)? Do these have different gene names (eg. Gene_A1 and Gene_A2)? Or are all duplicates of a gene integrated into a gene name and quantified together (eg. Gene_A)?

Also, do different splice variants of a gene have its own gene name and are quantified separately under the gtf or gff file of a standard RNA-seq analysis?

RNA-seq • 900 views
ADD COMMENT
0
Entering edit mode

do different splice variants of a gene have its own gene name and are quantified separately under the gtf or gff file of a standard RNA-seq analysis?

They don't have different gene IDs, but different transcript IDs, depending on the tool you are using for quantifying and the downstream analyses they can be quantified separately, but most people quantify them at the gene level.

ADD REPLY
0
Entering edit mode
8 months ago
vkkodali_ncbi ★ 3.7k

Note, this is for genes that are _annotated_ >1 time on the same assembly. Duplicated genes are typically just annotated twice with distinct GeneIDs; the gene symbol may be the same for these but the GeneID is not.

In the case of RefSeq annotation files, if GeneA is annotated twice then the gene_id attribute in the GTF file will be GeneA for the first instance and GeneA_1 for the second instance. In both cases, the gene attribute will have the value GeneA, which is the official symbol for that gene. For example, look at the annotation of the mouse gene Erdr1x. In the GFF3 file, the following gene rows are present:

NC_000086.8 BestRefSeq  pseudogene  168793522   168801793   .   +   .   ID=gene-Erdr1x;Dbxref=GeneID:170942,MGI:MGI:2384747;Name=Erdr1x;description=erythroid differentiation regulator 1 x;end_range=168801793,.;gbkey=Gene;gene=Erdr1x;gene_biotype=transcribed_pseudogene;gene_synonym=edr,Erdr1,Gm21887,Gm55594;partial=true;pseudo=true
NC_000087.8 BestRefSeq  pseudogene  90796711    90827734    .   +   .   ID=gene-Erdr1x-2;Dbxref=GeneID:170942,MGI:MGI:2384747;Name=Erdr1x;description=erythroid differentiation regulator 1 x;gbkey=Gene;gene=Erdr1x;gene_biotype=transcribed_pseudogene;gene_synonym=edr,Erdr1,Gm21887,Gm55594;pseudo=true

That same gene has the following two rows in GTF:

NC_000086.8 BestRefSeq  gene    168793522   168801793   .   +   .   gene_id "Erdr1x"; transcript_id ""; db_xref "GeneID:170942"; db_xref "MGI:MGI:2384747"; description "erythroid differentiation regulator 1 x"; gbkey "Gene"; gene "Erdr1x"; gene_biotype "transcribed_pseudogene"; gene_synonym "edr"; gene_synonym "Erdr1"; gene_synonym "Gm21887"; gene_synonym "Gm55594"; partial "true"; pseudo "true"; 
NC_000087.8 BestRefSeq  gene    90796711    90827734    .   +   .   gene_id "Erdr1x_1"; transcript_id ""; db_xref "GeneID:170942"; db_xref "MGI:MGI:2384747"; description "erythroid differentiation regulator 1 x"; gbkey "Gene"; gene "Erdr1x"; gene_biotype "transcribed_pseudogene"; gene_synonym "edr"; gene_synonym "Erdr1"; gene_synonym "Gm21887"; gene_synonym "Gm55594"; pseudo "true";
ADD COMMENT
0
Entering edit mode

Thanks for the update. If that's the case, is there any biological meaning of performing differential expression on the duplicates of the genes? Or do I have to find a way to integrate their quantification before differential analysis?

ADD REPLY
1
Entering edit mode

It really depends what you want. If you are really interested on those genes, and if they have diverged in sequence a bit, quantification algorithms that apply Expectation Maximisation might be able to pick the differences in expression between those genes. If they have not diverged enough and you are using a standard pipeline (e.g. featureCounts), reads will be classified as multi mappers and therefore not be able to quantify them.

ADD REPLY
0
Entering edit mode

I think for alignment algorithms like STAR outputs and preservers the multi mapping reads, so does that mean each duplicated version of the gene may be quantified?

Also, if the sequences are near identical between the duplicates, will majority of reads that only map to the coding regions will be discarded or still retained but only apportioned using Expectation maximisation because these will multi mappers? Whereas reads that flank both the non-coding sequences upstream or downstream of the gene I assume these reads might have a chance of being unique mappers that only map once onto one of the duplicated version. Are my assumptions correct? Sorry for the lengthy reply.

ADD REPLY
0
Entering edit mode

If I recall correctly STAR also uses a Expectation Maximisation algorithm, so as long as your sequences are divergent enough, you might be able to quantify them. in the case the are identical, I think the algorithm might distribute reads between both.

ADD REPLY

Login before adding your answer.

Traffic: 1795 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6