I've been running de novo transcriptome assembly on a large set of RNA-seq data (n=48 samples).
I'd like to combine the per-sample assemblies into a meta-assembly (representing potential novel transcripts present in the entire sample set) that I can use for downstream analyses.
Due to the general noisiness of RNAseq and assembly, I would like to filter out transcripts that are present in < 5 samples. From what I've found, Bedtools Compare Multiple Bed Files? suggests a solution using bedtools multiinter that would identify intervals present in N samples. However, the toy example given there shows intervals being split up based on the #samples that have the interval - ideally, my solution would keep the longest interval within which the samples with matches fall (this would correspond to keeping the longest transcript covering the region in question).
My question is: are there other approaches possible? Is there a way to do this using e.g. GRanges in R?
Right now my plan would be to
1. remove transcripts corresponding to the reference transcriptome
1. use bedtools multiinter on my full set of assemblies
2. take the intervals where >=5 samples overlap
3. for each interval, extract the relevant transcripts from the samples which have an overlap
4. take the longest transcript from each of these
5. use cuffcompare or cuffmerge with the reference transcriptome and the unannotated transcripts from step 4 -> gives a final, filtered high-confidence transcriptome.
However, if there's something out there which could do the whole workflow automagically, that would be grand.
Any thoughts and suggestions greatly appreciated, and thanks in advance!