Filtering de novo transcriptomes - finding transcripts present in N samples
Entering edit mode
9.0 years ago
m.fletcher ▴ 20


I've been running de novo transcriptome assembly on a large set of RNA-seq data (n=48 samples).

I'd like to combine the per-sample assemblies into a meta-assembly (representing potential novel transcripts present in the entire sample set) that I can use for downstream analyses.

Due to the general noisiness of RNAseq and assembly, I would like to filter out transcripts that are present in < 5 samples. From what I've found, Bedtools Compare Multiple Bed Files? suggests a solution using bedtools multiinter that would identify intervals present in N samples. However, the toy example given there shows intervals being split up based on the #samples that have the interval - ideally, my solution would keep the longest interval within which the samples with matches fall (this would correspond to keeping the longest transcript covering the region in question).

My question is: are there other approaches possible? Is there a way to do this using e.g. GRanges in R?

Right now my plan would be to

  1. Remove transcripts corresponding to the reference transcriptome
  2. Use bedtools multiinter on my full set of assemblies
  3. Take the intervals where >=5 samples overlap
  4. For each interval, extract the relevant transcripts from the samples which have an overlap
  5. Take the longest transcript from each of these
  6. Use cuffcompare or cuffmerge with the reference transcriptome and the unannotated transcripts from step 4 -> gives a final, filtered high-confidence transcriptome.

However, if there's something out there which could do the whole workflow automagically, that would be grand.

Any thoughts and suggestions greatly appreciated, and thanks in advance!

RNA-Seq assembly • 2.9k views
Entering edit mode

Try renaming your headers in each of the assembly. For example, sample1 assembly should have transcripts with name >sample1_contig_id1, >sample1_contig_id2. Similarly for sample 2, with >sample2_contig_id1, >sample2_contig_id2, .. so on. (A simple sed command should do this)

Then cluster them using cd-hit-est in such a way that the shorter transcripts with certain %identity and with %coverage would collapse into the longest contig. Those contigs which are present in multiple assemblies will form a bigger cluster. You can look into it and pick the clusters which have contigs from at least different samples.

Because cd-hit-est selects the representative you should be able to use that non-redundant fasta (depends on parameters) for further analysis.

Entering edit mode

Thanks, I will definitely try that approach out also!

Entering edit mode
8.2 years ago
Biogeek ▴ 470


I'd use cd-hit-est followed by cap3 to provide a condensed overlap layout re-assembly after de novo assembly...then...

Are you familiar with R? Surely you can do an RSEM run, build a matrix. filter the matrix based on your criteria, and remove such transcripts? I'm sure this can also be done using perl and python, although i'm not familiar with them languages too much.


Login before adding your answer.

Traffic: 2809 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6