Question: Best practices for counting genes with nanopore RNAseq data
gravatar for Florian Bernard
8 months ago by
Florian Bernard60 wrote:


I searched the previous posts and the nanopore community forum but couldn't find a definitive answer so here I am looking for your advice. I know working with long-reads RNA-seq data is still complicated because it's hard to find a general consensus on how to analyse them but I'd be interested in having your opinion.

I'm interested in measuring the number of genes detected in various RNA-seq libraries that I have generated with various nanopore library kits (DNA ligation of RT-PCR products, direct-cDNA and direct-RNA).

So far I've been using HTseq with the mode intersection-nonempty but it seems that now the union mode is the most recommended one. And i've also seen posts stating that featureCounts is much better and there's even an option for longreads.

I tried to compare the output of the three 'possibilities' but I cannot say it really helped me in the end:

  • HTseq + intersection-nonempty mode: 238K assigned + 0 ambiguous + 18K no_features + 24K too low quality

  • HTseq + union mode: 210K assigned + 44K ambiguous + 2K no_features + 24K too low quality

  • featureCounts + long-reads mode: 219K assigned + 59K ambiguous + 2K no_features + 0 too low quality

Based on those observations, would you be able to recommend me to use one versus the two others ? And if yes, why ? I'm trying to understand what makes one more suited to my need and/or generally more suited to the analyse of long reads RNA-seq data.

Thanks a lot.

nanopore rna-seq alignment • 501 views
ADD COMMENTlink modified 8 months ago by h.mon27k • written 8 months ago by Florian Bernard60

Based on these observations alone, I really believe there is no way of deciding between them. Please provide more information, e.g. command-lines used and if the libraries are stranded or not.

ADD REPLYlink modified 8 months ago • written 8 months ago by h.mon27k

Then if it's all the same, since I'm more interested by reads that have been assigned to a gene, can we say that HTseq + intersection-nonempty produce the more useful result ? At the same time I'm worried that I might be working with reads wrongly assigned.. Having very little experience in bioinformatics I don't think I have enough hindsight to to say if i should go for the lower number of assigned reads or the bigger..

ADD REPLYlink written 8 months ago by Florian Bernard60

I didn't say it is all the same, I said there is not enough information for us to provide a good suggestion.

ADD REPLYlink written 8 months ago by h.mon27k

I also think it's quite hard to decide between the methods. I'd like to suggest to use the -o option from HTSeq (the input format has to be SAM in this case). You can annotate the reads with the feature (gene id) and HTSeq's meta features. For genes where the read count differs, you can trace back the reads which made the difference.

Additionally, please keep in mind, that that the 5' site of the transcripts might be not 100% exactly annotated. Allowing for a read to jut out might not be the worst idea.

ADD REPLYlink written 8 months ago by michael.ante3.4k

I've been using the -o option already (I need it for a downstream analysis) but even if i manage to distinguish reads that are being counted from reads that are not, how will I be able to say why ?

ADD REPLYlink written 8 months ago by Florian Bernard60
gravatar for h.mon
8 months ago by
h.mon27k wrote:

Having thinking a little more about the problem, I think you may try Salmon with its alignment-based mode of quantification. The expectation-maximization algorithm will probably do a better job at sorting ambiguous reads than HTSeq / featureCounts. Be sure to read the manual:

ADD COMMENTlink written 8 months ago by h.mon27k

It might be a naive question but I read that in Salmon's documentation :

Salmon expects that the reads have been aligned directly to the transcriptome (like RSEM, eXpress, etc.) rather than to the genome (as does, e.g. Cufflinks).

I don't get the point of first maping against the transcriptome and then .. doing the same thing ? I mean if I align my reads to the transcriptome, won't I already have the answer I'm looking for (ie. numbers of reads aligning to one gene/feature)?

ADD REPLYlink written 8 months ago by Florian Bernard60

I have no experience with Nanopore reads, but I remember reading somewhere (sorry, can't find again where) that Salmon in quant mode has a low assignment rate for Nanopere reads, due to their high error rate. However, if you map independently to the transcriptome (e.g. with minimap2, or STARlong), and then quantify, you get a good assignment rate.

ADD REPLYlink written 8 months ago by h.mon27k

Ok that's interesting and definitely worth a try ! I'll came back to report how it went, thanks !

ADD REPLYlink written 8 months ago by Florian Bernard60

Salmon now being recommended by ONT and used in several papers, it seems it was a good call ! Thanks for your help.

ADD REPLYlink written 7 months ago by Florian Bernard60
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1440 users visited in the last hour