Question: How to deal with clone genes?
0
gravatar for BPors
5 months ago by
BPors40
BPors40 wrote:

Hi,

I am currently analysing RNA-seq data, and I realized in the GTF file (Ensembl v.96) which I use in mapping, there are ~19000 clone based (Ensembl) genes.

Some of them share the exons with their 'parent' protein coding genes in terms of genomic locations.

I am considering removing these clone based genes as they will be affecting the statistics of genome alignment (especially the multi-mappers), and transcript quantification which I plan to do later on.

The total number of genes in the Ensembl hg38.p12 GTF file is ~58000. So if I remove these clone based ones, I am left with ~37000 genes.

Would it be a good call to remove these clone based genes from GTF file? Or would it lower the power of the analyses?

Examples:

  • where non-coding Z83844.1 (Clone-based (Ensembl) gene) exons overlap with NOL12:

https://ibb.co/7tQM63Z

  • where coding AC008403.1 exons overlap with CYTH2 gene :

https://ibb.co/3vwhZg4

I would appreciate your suggestions. Thank you in advance.

ADD COMMENTlink modified 5 months ago by h.mon28k • written 5 months ago by BPors40
1

Can you post a screenshot from a genome browser showing such a gene? The term is non-standard. Do you refer to different isoforms? If so, no absolutely don't remove them. The cell uses isoforms for a reason so they might be biologically meaningful. It is also important to correct for isoform usage during analysis.

Say cell A only expresses isoA with length L1 and cell B expresses isoB with length L1*1.5 (it is 1.5times longer) you will get 1.5 times more reads from this isoB which would come out as differentially expressed due to length bias. A recommended way of doing this is to use the tximport package which calculates offsets for the linear models of DEG analysis based on transcript length. A recommended pipeline is salmon-tximport-DEG with DEG being e.g. DESeq2, edgeR or limma.

ADD REPLYlink modified 5 months ago • written 5 months ago by ATpoint26k

Sure. I attached a link to a screenshot in the question.

I do not refer them as isoforms, as they do not share the same gene source, their transcripts are generated separately from their exons, which do not have a consistent description in between Ensembl genome versions.

I could use protein coding genes & transcripts when using tximport to get rid of them in the downstream analyses, but the number of reads which maps to the exons of these clone based genes would skew the numbers in the wrong way already when I am in tximport stage?

ADD REPLYlink modified 5 months ago • written 5 months ago by BPors40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1008 users visited in the last hour