Hi,
I searched the previous posts and the nanopore community forum but couldn't find a definitive answer so here I am looking for your advice. I know working with long-reads RNA-seq data is still complicated because it's hard to find a general consensus on how to analyse them but I'd be interested in having your opinion.
I'm interested in measuring the number of genes detected in various RNA-seq libraries that I have generated with various nanopore library kits (DNA ligation of RT-PCR products, direct-cDNA and direct-RNA).
So far I've been using HTseq with the mode intersection-nonempty
but it seems that now the union
mode is the most recommended one. And i've also seen posts stating that featureCounts is much better and there's even an option for longreads
.
I tried to compare the output of the three 'possibilities' but I cannot say it really helped me in the end:
HTseq + intersection-nonempty mode: 238K assigned + 0 ambiguous + 18K no_features + 24K too low quality
HTseq + union mode: 210K assigned + 44K ambiguous + 2K no_features + 24K too low quality
featureCounts + long-reads mode: 219K assigned + 59K ambiguous + 2K no_features + 0 too low quality
Based on those observations, would you be able to recommend me to use one versus the two others ? And if yes, why ? I'm trying to understand what makes one more suited to my need and/or generally more suited to the analyse of long reads RNA-seq data.
Thanks a lot.
Based on these observations alone, I really believe there is no way of deciding between them. Please provide more information, e.g. command-lines used and if the libraries are stranded or not.
Then if it's all the same, since I'm more interested by reads that have been assigned to a gene, can we say that
HTseq + intersection-nonempty
produce the more useful result ? At the same time I'm worried that I might be working with reads wrongly assigned.. Having very little experience in bioinformatics I don't think I have enough hindsight to to say if i should go for the lower number of assigned reads or the bigger..I didn't say it is all the same, I said there is not enough information for us to provide a good suggestion.
I also think it's quite hard to decide between the methods. I'd like to suggest to use the -o option from HTSeq (the input format has to be SAM in this case). You can annotate the reads with the feature (gene id) and HTSeq's meta features. For genes where the read count differs, you can trace back the reads which made the difference.
Additionally, please keep in mind, that that the 5' site of the transcripts might be not 100% exactly annotated. Allowing for a read to jut out might not be the worst idea.
I've been using the -o option already (I need it for a downstream analysis) but even if i manage to distinguish reads that are being counted from reads that are not, how will I be able to say why ?
Hello, Florian,
I have the same question and I'm very interested in your results. What was the best practice? Firstly minimap2 then salmon?