Question

Best practices for counting genes with nanopore RNAseq data

2

Entering edit mode

5.7 years ago

Florian Bernard ▴ 90

Hi,

I searched the previous posts and the nanopore community forum but couldn't find a definitive answer so here I am looking for your advice. I know working with long-reads RNA-seq data is still complicated because it's hard to find a general consensus on how to analyse them but I'd be interested in having your opinion.

I'm interested in measuring the number of genes detected in various RNA-seq libraries that I have generated with various nanopore library kits (DNA ligation of RT-PCR products, direct-cDNA and direct-RNA).

So far I've been using HTseq with the mode intersection-nonempty but it seems that now the union mode is the most recommended one. And i've also seen posts stating that featureCounts is much better and there's even an option for longreads.

I tried to compare the output of the three 'possibilities' but I cannot say it really helped me in the end:

HTseq + intersection-nonempty mode: 238K assigned + 0 ambiguous + 18K no_features + 24K too low quality
HTseq + union mode: 210K assigned + 44K ambiguous + 2K no_features + 24K too low quality
featureCounts + long-reads mode: 219K assigned + 59K ambiguous + 2K no_features + 0 too low quality

Based on those observations, would you be able to recommend me to use one versus the two others ? And if yes, why ? I'm trying to understand what makes one more suited to my need and/or generally more suited to the analyse of long reads RNA-seq data.

Thanks a lot.

rna-seq nanopore alignment • 3.9k views

ADD COMMENT • link updated 4.9 years ago by v.shapovalova1 ▴ 10 • written 5.7 years ago by Florian Bernard ▴ 90

0

Entering edit mode

Based on these observations alone, I really believe there is no way of deciding between them. Please provide more information, e.g. command-lines used and if the libraries are stranded or not.

ADD REPLY • link 5.7 years ago by h.mon 35k

0

Entering edit mode

Then if it's all the same, since I'm more interested by reads that have been assigned to a gene, can we say that HTseq + intersection-nonempty produce the more useful result ? At the same time I'm worried that I might be working with reads wrongly assigned.. Having very little experience in bioinformatics I don't think I have enough hindsight to to say if i should go for the lower number of assigned reads or the bigger..

ADD REPLY • link 5.7 years ago by Florian Bernard ▴ 90

0

Entering edit mode

I didn't say it is all the same, I said there is not enough information for us to provide a good suggestion.

ADD REPLY • link 5.7 years ago by h.mon 35k

0

Entering edit mode

I also think it's quite hard to decide between the methods. I'd like to suggest to use the -o option from HTSeq (the input format has to be SAM in this case). You can annotate the reads with the feature (gene id) and HTSeq's meta features. For genes where the read count differs, you can trace back the reads which made the difference.

Additionally, please keep in mind, that that the 5' site of the transcripts might be not 100% exactly annotated. Allowing for a read to jut out might not be the worst idea.

ADD REPLY • link 5.7 years ago by michael.ante ★ 3.8k

0

Entering edit mode

I've been using the -o option already (I need it for a downstream analysis) but even if i manage to distinguish reads that are being counted from reads that are not, how will I be able to say why ?

ADD REPLY • link 5.7 years ago by Florian Bernard ▴ 90

0

Entering edit mode

Hello, Florian,

I have the same question and I'm very interested in your results. What was the best practice? Firstly minimap2 then salmon?

ADD REPLY • link 4.9 years ago by v.shapovalova1 ▴ 10

score 2 · Accepted Answer · 2019-02-04

2

Entering edit mode

5.7 years ago

h.mon 35k

Having thinking a little more about the problem, I think you may try Salmon with its alignment-based mode of quantification. The expectation-maximization algorithm will probably do a better job at sorting ambiguous reads than HTSeq / featureCounts. Be sure to read the manual: https://salmon.readthedocs.io/en/latest/salmon.html

ADD COMMENT • link 5.7 years ago by h.mon 35k

0

Entering edit mode

It might be a naive question but I read that in Salmon's documentation :

Salmon expects that the reads have been aligned directly to the transcriptome (like RSEM, eXpress, etc.) rather than to the genome (as does, e.g. Cufflinks).

I don't get the point of first maping against the transcriptome and then .. doing the same thing ? I mean if I align my reads to the transcriptome, won't I already have the answer I'm looking for (ie. numbers of reads aligning to one gene/feature)?

ADD REPLY • link 5.7 years ago by Florian Bernard ▴ 90

0

Entering edit mode

I have no experience with Nanopore reads, but I remember reading somewhere (sorry, can't find again where) that Salmon in quant mode has a low assignment rate for Nanopere reads, due to their high error rate. However, if you map independently to the transcriptome (e.g. with minimap2, or STARlong), and then quantify, you get a good assignment rate.

ADD REPLY • link 5.7 years ago by h.mon 35k

0

Entering edit mode

Ok that's interesting and definitely worth a try ! I'll came back to report how it went, thanks !

ADD REPLY • link 5.7 years ago by Florian Bernard ▴ 90

0

Entering edit mode

Salmon now being recommended by ONT and used in several papers, it seems it was a good call ! Thanks for your help.

ADD REPLY • link 5.6 years ago by Florian Bernard ▴ 90