I've been running de novo transcriptome assembly on a large set of RNA-seq data (n=48 samples).
I'd like to combine the per-sample assemblies into a meta-assembly (representing potential novel transcripts present in the entire sample set) that I can use for downstream analyses.
Due to the general noisiness of RNAseq and assembly, I would like to filter out transcripts that are present in < 5 samples. From what I've found, Bedtools Compare Multiple Bed Files? suggests a solution using bedtools multiinter that would identify intervals present in N samples. However, the toy example given there shows intervals being split up based on the #samples that have the interval - ideally, my solution would keep the longest interval within which the samples with matches fall (this would correspond to keeping the longest transcript covering the region in question).
My question is: are there other approaches possible? Is there a way to do this using e.g. GRanges in R?
Right now my plan would be to
- Remove transcripts corresponding to the reference transcriptome
- Use bedtools multiinter on my full set of assemblies
- Take the intervals where >=5 samples overlap
- For each interval, extract the relevant transcripts from the samples which have an overlap
- Take the longest transcript from each of these
- Use cuffcompare or cuffmerge with the reference transcriptome and the unannotated transcripts from step 4 -> gives a final, filtered high-confidence transcriptome.
However, if there's something out there which could do the whole workflow automagically, that would be grand.
Any thoughts and suggestions greatly appreciated, and thanks in advance!