I am working with chicken data and we would like to improve the genes annotation by combining short (Illumina) and long reads (Nanopore) data. Thus we decided to build a de novo transcriptome assembly, guided by the available genome of the chicken.
I tried different approaches:
- StringTie2 that gives a lot of artifacts (we end up with 60,000 genes !)
- Scallop-LR that does not work (only with Pacbio data)
- Scallop that works fine but also gives a lot of artifacts
In each case, I tried to run a) the 2 datasets together (in one run) and b) the 2 datasets separately and then merge the results. The caveat of a) is that the parameters used for long reads are very different than the one for short reads, so I have to choose something "in-between" which is not optimized. The caveat of b) is that it is inscreasing the number of genes detected because we keep lots of artifactual transcripts.
Of course, I could use more stringent parameters for the merging, but I am wondering whether any of you have the experience of dealing with the integration of short and long reads ? How would you reduce the number of false positives ? I know I could also use: Mikado, Trinity, IDP-denovo for this kind of issues. Any feedback on using these tools (or any other) in this context would be welcome !
Have you used the available chicken transcripts?
Yes, but we found lots of signal outside of the annotated genes and wanted to investigate further
Do you get a lot of full-length RNAs in the Nanopore?Try to map it to the known transcriptome. I would take an approach of assembling with the long reads and correcting with short reads. It's black magic though, there is no one protocol that works for all.
The coverage I have in Nanopore is very low, so I'm not sure this would be so relevant but I could try.