I am working with chicken data and we would like to improve the genes annotation by combining short (Illumina) and long reads (Nanopore) data. Thus we decided to build a de novo transcriptome assembly, guided by the available genome of the chicken.
I tried different approaches:
- StringTie2 that gives a lot of artifacts (we end up with 60,000 genes !)
- Scallop-LR that does not work (only with Pacbio data)
- Scallop that works fine but also gives a lot of artifacts
In each case, I tried to run a) the 2 datasets together (in one run) and b) the 2 datasets separately and then merge the results. The caveat of a) is that the parameters used for long reads are very different than the one for short reads, so I have to choose something "in-between" which is not optimized. The caveat of b) is that it is inscreasing the number of genes detected because we keep lots of artifactual transcripts.
Of course, I could use more stringent parameters for the merging, but I am wondering whether any of you have the experience of dealing with the integration of short and long reads ? How would you reduce the number of false positives ? I know I could also use: Mikado, Trinity, IDP-denovo for this kind of issues. Any feedback on using these tools (or any other) in this context would be welcome !