Question: Gene level RNA seq expression measure
2
gravatar for Bnf83
3.0 years ago by
Bnf83130
Bnf83130 wrote:

Hi guys, I'm a newbie in the RNA Seq world. I'm analyzing some RNA seq data (HiSeq) and doing this, I'm following the best practice guideline available online form different sources. Briefly I performed the alignment to the reference genome, the merge, I removed duplicates, then I performed the sorting, indexing and finally using bedtools multicov I ended up having the reads count per RefSeq. My point is that I would like to have a gene level expression measure but after the counts I have reads counts per transcript variants. in other words I have multiple transcripts per gene with corresponding reads count but I would like to have one reads counts per gene. How do you deal with this issue?

Thanks in advance

B.

rna-seq gene expression • 1.2k views
ADD COMMENTlink modified 3.0 years ago by victor.ao.carmelo50 • written 3.0 years ago by Bnf83130

What do you want to use the counts for? Normally you'd use featureCounts for that (and not remove duplicates, which is generally a bad idea in RNAseq).

ADD REPLYlink written 3.0 years ago by Devon Ryan88k

I only would like to know if the genes are differentially expressed. I don't want to know if a specific splicing variant is differentially expressed. Only the gene.

ADD REPLYlink written 3.0 years ago by Bnf83130

As Devon wrote, use featureCounts (or htseq-count or ...), which you can use to count per gene.

ADD REPLYlink written 3.0 years ago by WouterDeCoster37k
4
gravatar for victor.ao.carmelo
3.0 years ago by
Denmark
victor.ao.carmelo50 wrote:

Firstly I would agree that removing duplicates is generally only something that should done if there are reasons, and not as a default.

It sounds like you used a reference set of refseq transcripts when counting with multicov. As far as I can see multicov performs a naive count of the reads in each interval given, which means the counts you observe on transcript level are not really meaningful as transcripts have many overlaps.

To get what you want you simply need to count features on a gene level. Both featureCounts as mentioned before and for example htseq-count can be used for this, as they can count reads on a gene metalevel which will count reads from all possible exons in each gene and summarize the results. I am unsure about the annotation quality of other species, but in human I have usually used the lastest version of gencode annotation in gtf or gff format (which includes gene and exon information) together with htseq-count/featureCounts. Note that your annotation version and reference genome used in alignment must match. You just have to make sure the meta feature you are counting is gene, and the feature is exons.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by victor.ao.carmelo50

Hi Victor, I also use the gtf file from GENECODE, but I was thinking of counting only the reads mapping to exons (features) which have the biotype "protein coding". Would this be a mistake? Should I consider all exons? I also use featureCounts.

ADD REPLYlink written 2.8 years ago by mr.udvr20

Having looked recently at some human RiboSeq data, I'm not sure how accurate the biotypes tend to be. I've seen a good number of non-protein coding getting translated.

ADD REPLYlink written 2.8 years ago by Devon Ryan88k

Oh, I see. Thanks Devon! However, I have mouse data not human.

ADD REPLYlink written 2.8 years ago by mr.udvr20

The organism won't matter.

ADD REPLYlink written 2.8 years ago by Devon Ryan88k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1781 users visited in the last hour