Question: Unexpected gene polymorphism using Salmon-tximeta-DESeq2
gravatar for raywong.chn
12 days ago by
raywong.chn10 wrote:

We're analyzing RNAseq data with a pipeline consisting of Salmon, tximeta, and DESeq2.

We have a multi-factorial experimental design, and the experiment was performed on cell lines.

On thing that surprised us is that in the result output, we observe many gene polymorphisms.

For example, for gene NLRP2 we observed multiple entries associated with unique ensembl IDs ENSG00000022556, ENSG00000275082, ENSG00000275843, etc.

baseMean    log2FoldChange  pvalue  padj    gene    CTRL_1  CTRL_2  A_1 A_2 B_1 B_2 A+B_1   A+B_2
ENSG00000022556 559.2711127 -1.709470173    5.51E-09    2.16E-07    NLRP2   33.063154   17.498608   23.790824   28.562371   6.421092    6.755627    29.858583   23.977158
ENSG00000275082 349.6580809 2.406888875 0.592471935 0.817837758 NLRP2   0   7.920205    10.814798   0   18.640884   18.543885   0   3.545411

My question is how do we interpret data like this? And how to deal with this kind of situation? Can we add/average different entries associated with the same gene?

rna-seq alignment • 140 views
ADD COMMENTlink modified 4 days ago • written 12 days ago by raywong.chn10

I think the problem is that you simply conducted transcript-level DGE-analysis. What kind of organism are you using? What reference did you use? How did you annotate your transcripts? Maybe you should use tximport to conduct gene-level DGE analyses as recommended in this paper. tximport basically needs a two-column dataframe with transcript ID and gene ID. It then summarises read counts per gene prior to DGE-analyses in DESeq2. I would not recommend to manually summarise counts.

ADD REPLYlink written 12 days ago by ponganta50

This is unrelated to transcripts, OP is already aggregating to gene level via tximeta. There is ambiguity in the Ensembl annotations towards gene_id (the Ensembl identifiers) and the gene_name (the "trivial" gene name, HGNC). Several Ensembl IDs are mapped to two HGNC names and some to no HGNC name at all. There is no universal rule for this. Sometimes people simply randomly select one of the two (or many), or choose the one with higher avergage expression, or simply keep all. How many of those ambiguous calls do you have?

ADD REPLYlink modified 12 days ago • written 12 days ago by ATpoint44k

@ponganta thanks for the input and @ATpoint thanks a lot for clarifying things up.

There are 490 genes containing calls to multiple genomic loci. The number of ambiguous calls for each gene varies, ranging from 2-7.

ADD REPLYlink written 12 days ago by raywong.chn10
gravatar for swbarnes2
12 days ago by
United States
swbarnes29.4k wrote:

The annotation is what it is. Your first example is located on a real chromosome, the second is on a scaffold, FWIW.

Just keep the ensemble IDs as the primary identifier all the way through. They are unique.

ADD COMMENTlink written 12 days ago by swbarnes29.4k

@swbarnes2 Yes you're right. I guess what I'm really concerned about is how to interpret this at the biological level. If we believe these gene polymorphism to be bona fide mapping, how did it happen in a cultured cell line?

ADD REPLYlink written 11 days ago by raywong.chn10
gravatar for raywong.chn
4 days ago by
raywong.chn10 wrote:

Problem solved.

It turns out that this is due to building the salmon index with ensembl genome fasta, which contains plenty of gene duplicates on haplotype chromosomes.

Switching to GENCODE should resolve the issue, as suggested in this thread:

ADD COMMENTlink modified 4 days ago • written 4 days ago by raywong.chn10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1848 users visited in the last hour