Question: Best practices for counting genes with nanopore RNAseq data
1
gravatar for Florian Bernard
18 months ago by
Florian Bernard70 wrote:

Hi,

I searched the previous posts and the nanopore community forum but couldn't find a definitive answer so here I am looking for your advice. I know working with long-reads RNA-seq data is still complicated because it's hard to find a general consensus on how to analyse them but I'd be interested in having your opinion.

I'm interested in measuring the number of genes detected in various RNA-seq libraries that I have generated with various nanopore library kits (DNA ligation of RT-PCR products, direct-cDNA and direct-RNA).

So far I've been using HTseq with the mode intersection-nonempty but it seems that now the union mode is the most recommended one. And i've also seen posts stating that featureCounts is much better and there's even an option for longreads.

I tried to compare the output of the three 'possibilities' but I cannot say it really helped me in the end:

  • HTseq + intersection-nonempty mode: 238K assigned + 0 ambiguous + 18K no_features + 24K too low quality

  • HTseq + union mode: 210K assigned + 44K ambiguous + 2K no_features + 24K too low quality

  • featureCounts + long-reads mode: 219K assigned + 59K ambiguous + 2K no_features + 0 too low quality

Based on those observations, would you be able to recommend me to use one versus the two others ? And if yes, why ? I'm trying to understand what makes one more suited to my need and/or generally more suited to the analyse of long reads RNA-seq data.

Thanks a lot.

nanopore rna-seq alignment • 1.0k views
ADD COMMENTlink modified 8 months ago by v.shapovalova110 • written 18 months ago by Florian Bernard70

Based on these observations alone, I really believe there is no way of deciding between them. Please provide more information, e.g. command-lines used and if the libraries are stranded or not.

ADD REPLYlink modified 18 months ago • written 18 months ago by h.mon30k

Then if it's all the same, since I'm more interested by reads that have been assigned to a gene, can we say that HTseq + intersection-nonempty produce the more useful result ? At the same time I'm worried that I might be working with reads wrongly assigned.. Having very little experience in bioinformatics I don't think I have enough hindsight to to say if i should go for the lower number of assigned reads or the bigger..

ADD REPLYlink written 18 months ago by Florian Bernard70

I didn't say it is all the same, I said there is not enough information for us to provide a good suggestion.

ADD REPLYlink written 18 months ago by h.mon30k

I also think it's quite hard to decide between the methods. I'd like to suggest to use the -o option from HTSeq (the input format has to be SAM in this case). You can annotate the reads with the feature (gene id) and HTSeq's meta features. For genes where the read count differs, you can trace back the reads which made the difference.

Additionally, please keep in mind, that that the 5' site of the transcripts might be not 100% exactly annotated. Allowing for a read to jut out might not be the worst idea.

ADD REPLYlink written 18 months ago by michael.ante3.6k

I've been using the -o option already (I need it for a downstream analysis) but even if i manage to distinguish reads that are being counted from reads that are not, how will I be able to say why ?

ADD REPLYlink written 18 months ago by Florian Bernard70

Hello, Florian,

I have the same question and I'm very interested in your results. What was the best practice? Firstly minimap2 then salmon?

ADD REPLYlink written 8 months ago by v.shapovalova110
2
gravatar for h.mon
18 months ago by
h.mon30k
Brazil
h.mon30k wrote:

Having thinking a little more about the problem, I think you may try Salmon with its alignment-based mode of quantification. The expectation-maximization algorithm will probably do a better job at sorting ambiguous reads than HTSeq / featureCounts. Be sure to read the manual: https://salmon.readthedocs.io/en/latest/salmon.html

ADD COMMENTlink written 18 months ago by h.mon30k

It might be a naive question but I read that in Salmon's documentation :

Salmon expects that the reads have been aligned directly to the transcriptome (like RSEM, eXpress, etc.) rather than to the genome (as does, e.g. Cufflinks).

I don't get the point of first maping against the transcriptome and then .. doing the same thing ? I mean if I align my reads to the transcriptome, won't I already have the answer I'm looking for (ie. numbers of reads aligning to one gene/feature)?

ADD REPLYlink written 18 months ago by Florian Bernard70

I have no experience with Nanopore reads, but I remember reading somewhere (sorry, can't find again where) that Salmon in quant mode has a low assignment rate for Nanopere reads, due to their high error rate. However, if you map independently to the transcriptome (e.g. with minimap2, or STARlong), and then quantify, you get a good assignment rate.

ADD REPLYlink written 18 months ago by h.mon30k

Ok that's interesting and definitely worth a try ! I'll came back to report how it went, thanks !

ADD REPLYlink written 18 months ago by Florian Bernard70

Salmon now being recommended by ONT and used in several papers, it seems it was a good call ! Thanks for your help.

ADD REPLYlink written 17 months ago by Florian Bernard70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 750 users visited in the last hour