Question: Mapping low coverage RNAseq at gene-level rather than exon level
gravatar for Daniel
5.5 years ago by
Cardiff University
Daniel3.8k wrote:

I am mapping an RNAseq dataset (illumina 100bp PE) to a robust genome, but have only ~10 million per sample to cover 217,000 exons. This was by design as I'm primarily interested in differential expression, but my default mapping stage is with Tophat and a full gtf annotation file, and this salami-slices the abundance counts down quite significantly (a large proportion of exons finish with 0 reads).

My first thought was to investigate the genes as single units, ignoring splice variation. I thought to grep out 'gene' units from a gff file, convert to a gtf and provide to tophat, but it appears that gffread from cufflinks will only convert exons and CDS (unless I'm missing a flag). Also, I'm aware that Tophat is specifically based around splice variants and whether the fudge would break it. Alternatively, I imagine using the exon-mapped reads and combining the read counts per-gene would be more appropriate. Is there a common tool for the process that anyone can point at?

Otherwise, would a different mapper than Tophat be better (one not focussed on splice variants).



rna-seq • 1.5k views
ADD COMMENTlink modified 5.5 years ago by jshall40 • written 5.5 years ago by Daniel3.8k

how did you produce the exon counts in the first place? I can't understand your problem very well. are you worried that you're losing many counts by some splicing specific procedure by tophat? actually if you provide tophat with a gtf file what the program does is to compute the genes transcripts as continuous sequences and align the reads to them. so in that case you would not be losing reads due to splicing. in general if you want to do a gene level analysis tophat + htseq-count + deseq2 is a very suitable pipeline, even with 10M reads.

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Martombo2.7k
gravatar for jshall
5.5 years ago by
United States
jshall40 wrote:

You're probably best letting your mapper (ie Tophat or something else like STAR) have all the splice junction information found in the original GFF, it'll be better able to align reads along known junctions. If you take the BAM files output by your mapper and run them through htseq-count, by default it will sum up all the hits to various exons to provide gene-level counts. 

Basically, for DE its most common to map at the exon level and quantify at the gene level. 

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by jshall40

Thanks, I've got HTSeq running now. On my first pass I used cufflinks and cummerbund and I got some weird results which prompted this question. The last time I transcriptome assembly was 7 years ago and things have changed a fair bit! Also, for my sanity, what does 'DE' mean in your last sentence? 

ADD REPLYlink written 5.5 years ago by Daniel3.8k

Oh, DE =  Differential Expression. 

ADD REPLYlink written 5.5 years ago by jshall40

Ah, of course! thanks!

ADD REPLYlink written 5.5 years ago by Daniel3.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1109 users visited in the last hour