Question

Gene level RNA seq expression measure

2

Entering edit mode

8.1 years ago

Bnf83 ▴ 150

Hi guys, I'm a newbie in the RNA Seq world. I'm analyzing some RNA seq data (HiSeq) and doing this, I'm following the best practice guideline available online form different sources. Briefly I performed the alignment to the reference genome, the merge, I removed duplicates, then I performed the sorting, indexing and finally using bedtools multicov I ended up having the reads count per RefSeq. My point is that I would like to have a gene level expression measure but after the counts I have reads counts per transcript variants. in other words I have multiple transcripts per gene with corresponding reads count but I would like to have one reads counts per gene. How do you deal with this issue?

Thanks in advance

B.

RNA-Seq gene expression • 2.3k views

ADD COMMENT • link updated 8.1 years ago by victor.ao.carmelo ▴ 50 • written 8.1 years ago by Bnf83 ▴ 150

0

Entering edit mode

What do you want to use the counts for? Normally you'd use featureCounts for that (and not remove duplicates, which is generally a bad idea in RNAseq).

ADD REPLY • link 8.1 years ago by Devon Ryan 104k

0

Entering edit mode

I only would like to know if the genes are differentially expressed. I don't want to know if a specific splicing variant is differentially expressed. Only the gene.

ADD REPLY • link 8.1 years ago by Bnf83 ▴ 150

0

Entering edit mode

As Devon wrote, use featureCounts (or htseq-count or ...), which you can use to count per gene.

ADD REPLY • link 8.1 years ago by WouterDeCoster 47k

score 4 · Answer 1 · 2016-04-05

4

Entering edit mode

8.1 years ago

victor.ao.carmelo ▴ 50

Firstly I would agree that removing duplicates is generally only something that should done if there are reasons, and not as a default.

It sounds like you used a reference set of refseq transcripts when counting with multicov. As far as I can see multicov performs a naive count of the reads in each interval given, which means the counts you observe on transcript level are not really meaningful as transcripts have many overlaps.

To get what you want you simply need to count features on a gene level. Both featureCounts as mentioned before and for example htseq-count can be used for this, as they can count reads on a gene metalevel which will count reads from all possible exons in each gene and summarize the results. I am unsure about the annotation quality of other species, but in human I have usually used the lastest version of gencode annotation in gtf or gff format (which includes gene and exon information) together with htseq-count/featureCounts. Note that your annotation version and reference genome used in alignment must match. You just have to make sure the meta feature you are counting is gene, and the feature is exons.

ADD COMMENT • link 8.1 years ago by victor.ao.carmelo ▴ 50

0

Entering edit mode

Hi Victor, I also use the gtf file from GENECODE, but I was thinking of counting only the reads mapping to exons (features) which have the biotype "protein coding". Would this be a mistake? Should I consider all exons? I also use featureCounts.

ADD REPLY • link 7.9 years ago by mr.udvr ▴ 20

0

Entering edit mode

Having looked recently at some human RiboSeq data, I'm not sure how accurate the biotypes tend to be. I've seen a good number of non-protein coding getting translated.

ADD REPLY • link 7.9 years ago by Devon Ryan 104k

0

Entering edit mode

Oh, I see. Thanks Devon! However, I have mouse data not human.

ADD REPLY • link 7.9 years ago by mr.udvr ▴ 20

0

Entering edit mode

The organism won't matter.

ADD REPLY • link 7.9 years ago by Devon Ryan 104k