Question: Filtering de novo transcriptomes - finding transcripts present in N samples
gravatar for m.fletcher
5.6 years ago by
m.fletcher20 wrote:


I've been running de novo transcriptome assembly on a large set of RNA-seq data (n=48 samples).

I'd like to combine the per-sample assemblies into a meta-assembly (representing potential novel transcripts present in the entire sample set) that I can use for downstream analyses.

Due to the general noisiness of RNAseq and assembly, I would like to filter out transcripts that are present in < 5 samples. From what I've found, Bedtools Compare Multiple Bed Files? suggests a solution using bedtools multiinter that would identify intervals present in N samples. However, the toy example given there shows intervals being split up based on the #samples that have the interval - ideally, my solution would keep the longest interval within which the samples with matches fall (this would correspond to keeping the longest transcript covering the region in question).

My question is: are there other approaches possible? Is there a way to do this using e.g. GRanges in R?

Right now my plan would be to

1. remove transcripts corresponding to the reference transcriptome

1. use bedtools multiinter on my full set of assemblies

2. take the intervals where >=5 samples overlap

3. for each interval, extract the relevant transcripts from the samples which have an overlap

4. take the longest transcript from each of these

5. use cuffcompare or cuffmerge with the reference transcriptome and the unannotated transcripts from step 4 -> gives a final, filtered high-confidence transcriptome.

However, if there's something out there which could do the whole workflow automagically, that would be grand.

Any thoughts and suggestions greatly appreciated, and thanks in advance!

rna-seq assembly • 2.1k views
ADD COMMENTlink modified 4.9 years ago by Biogeek400 • written 5.6 years ago by m.fletcher20

Try renaming your headers in each of the assembly. For example, sample1 assembly should have transcripts with name ">sample1_contig_id1", ">sample1_contig_id2". Similarly for sample 2, with ">sample2_contig_id1",">sample2_contig_id2" on. (simple sed command should do this)

Then cluster them using cd-hit-est in such a way that the shorter transcripts with certain %identity and with %coverage would collapse into the longest contig. Those contigs which are present in multiple assemblies will form a bigger cluster. You can look into it and pick the clusters which have contigs from atleast different samples.

Because cd-hit-est selects the representative you should be able to use that non-redundant fasta (depends on parameters) for further analysis.

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by Prakki Rama2.4k

Thanks, I will definitely try that approach out also!

ADD REPLYlink written 5.6 years ago by m.fletcher20
gravatar for Biogeek
4.9 years ago by
Biogeek400 wrote:


I'd use cd-hit-est followed by cap3 to provide a condensed overlap layout re-assembly after de novo assembly...then...

Are you familiar with R? Surely you can do an RSEM run, build a matrix. filter the matrix based on your criteria, and remove such transcripts? I'm sure this can also be done using perl and python, although i'm not familiar with them languages too much.

ADD COMMENTlink written 4.9 years ago by Biogeek400
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1421 users visited in the last hour