Question

Filtering de novo transcriptomes - finding transcripts present in N samples

1

Entering edit mode

8.7 years ago

m.fletcher ▴ 20

Gidday,

I've been running de novo transcriptome assembly on a large set of RNA-seq data (n=48 samples).

I'd like to combine the per-sample assemblies into a meta-assembly (representing potential novel transcripts present in the entire sample set) that I can use for downstream analyses.

Due to the general noisiness of RNAseq and assembly, I would like to filter out transcripts that are present in < 5 samples. From what I've found, Bedtools Compare Multiple Bed Files? suggests a solution using bedtools multiinter that would identify intervals present in N samples. However, the toy example given there shows intervals being split up based on the #samples that have the interval - ideally, my solution would keep the longest interval within which the samples with matches fall (this would correspond to keeping the longest transcript covering the region in question).

My question is: are there other approaches possible? Is there a way to do this using e.g. GRanges in R?

Right now my plan would be to

Remove transcripts corresponding to the reference transcriptome
Use bedtools multiinter on my full set of assemblies
Take the intervals where >=5 samples overlap
For each interval, extract the relevant transcripts from the samples which have an overlap
Take the longest transcript from each of these
Use cuffcompare or cuffmerge with the reference transcriptome and the unannotated transcripts from step 4 -> gives a final, filtered high-confidence transcriptome.

However, if there's something out there which could do the whole workflow automagically, that would be grand.

Any thoughts and suggestions greatly appreciated, and thanks in advance!

RNA-Seq assembly • 2.8k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.7 years ago by m.fletcher ▴ 20

0

Entering edit mode

Try renaming your headers in each of the assembly. For example, sample1 assembly should have transcripts with name >sample1_contig_id1, >sample1_contig_id2. Similarly for sample 2, with >sample2_contig_id1, >sample2_contig_id2, .. so on. (A simple sed command should do this)

Then cluster them using cd-hit-est in such a way that the shorter transcripts with certain %identity and with %coverage would collapse into the longest contig. Those contigs which are present in multiple assemblies will form a bigger cluster. You can look into it and pick the clusters which have contigs from at least different samples.

Because cd-hit-est selects the representative you should be able to use that non-redundant fasta (depends on parameters) for further analysis.

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Thanks, I will definitely try that approach out also!

ADD REPLY • link 8.7 years ago by m.fletcher ▴ 20

score 0 · Answer 1 · 2016-04-22

M.fletcher,

I'd use cd-hit-est followed by cap3 to provide a condensed overlap layout re-assembly after de novo assembly...then...

Are you familiar with R? Surely you can do an RSEM run, build a matrix. filter the matrix based on your criteria, and remove such transcripts? I'm sure this can also be done using perl and python, although i'm not familiar with them languages too much.