Hi everyone. Need your advice on an issue I'm having.
So, I have used Trinity in order to create transcriptome assemblies from ~30 data sets of RNA-Seq, downloaded from SRA. All data sets come from the same species (tomato), but from different variants, conditions and tissues. My aim is to produce some type of 'pan transcriptome', and thus I', trying to get very diverse data. As a result, I can't just throw all the reads at Trinity - this is just a huge ammount of data and the data sets are quite different from each other in terms of read length, coverage, library type etc.
Now I have ~30 fasta files, each derived from a single data set. Next, I'd like to merge them in order to get a single, unified transcripts set. I intend to use it in order to annotate multiple other genomes of tomatoes and wild relatives that I have de-novo assembled. Here are my questions:
1) Do you know of a tool that can do the merging I'm looking for? When I say merge, I mean do it in a smart way, for example use overlaps to elongate transcripts, collapse partial transcripts into full ones, if they exist in the data and so on. So far, I've looked into StringTie's merge function, but it requires gff files rather than fasta (as produced by Trinity), and DRAP's runMeta module, but this one requires DRAP assembly outputs, which I do not have.
2) I've read an old post suggesting that an OLC assembler might be useful in this case. Do you think it could work? Can you recommend a good and modern assembler aimed at transcriptomic data?
3) Do you think that the merge step is even necessary? I am planning on using MAKER annotation pipeline, and not sure how it would perform with a messy transcripts set containing duplicates and partial transcripts. But maybe it is not a problem and I'm wasting time over a non-relevant procedure?
Of course, any other advice would be appreciated.
Thanks a lot!