Question

Cufflinks For Rare Transcript Detection

2

Entering edit mode

12.1 years ago

Darked89 4.6k

Hello,

I got 20 RNA-Seq samples from one tissue / different animals / same species. I mapped these to the genome one by one using tophat, then run cufflinks and cuffmerge.

My feeling is that some transcripts expressed at very low level are discarded by cufflinks, and subsequent merging can not rescue them. Hence my questions:

can I do any better than this?
apart from merging 20 individual BAM files into one giant one, what are the options?
any experience with cufflinks alternatives which may be better from genome annotation point of view?

Thanks a lot for your help

EDIT

@malachig:

re: multiple fastq solution: this depends how tophat does the mapping. If mapping reads in spliced mode depends dramatically on positions of already mapped reads, then sure, combining all FASTQ in one go is better than merging BAMs. On the other hand if 2x 96bp RNA-Seq mapping by tophat is not reliable without prior coverage of exons by unspliced reads / entries in GTF file then one should check other mappers.

re .gtf: Yes, I got ENSEMBL annotation which I used for mapping. Here is the relevant part:

--min-intron-length 21 --max-intron-length 200000   --segment-mismatches 1 --butterfly-search --GTF my_ensembl.gtf

re another approaches:

I also used our in house pipeline based on GEM for mapping. I will try Trinity, possibly also Trans-Abyss.

cufflinks transcriptome genome tophat • 3.0k views

ADD COMMENT • link updated 12.1 years ago by Malachi Griffith 19k • written 12.1 years ago by Darked89 4.6k

score 1 · Answer 1 · 2012-03-10

It seems like you would not actually need to merge the 20 BAMs to run Tophat with all the data at once. Simply supply Tophat with your input fastq files from 20 individual samples as if they were multiple lanes for the same sample.

You might also consider supplying a .gtf file to Tophat using the -G option to help Tophat find splice junctions. Make this file as comprehensive as possible. For example, you might merge transcripts from Refseq, Ensembl, UCSC, Vega, CCDS, MGC, etc. Note that the '-G' option for Tophat is not related to the '-G' option for Cufflinks. In particular, it does not mean that you will be limited to known junctions only, but Tophat will build a junction database to check against before attempting to find a novel exon-exon junction.

With the resulting BAM file you could run Cufflinks and predict a more 'complete' set of transcripts. You could then take the resulting .gtf file, merge it with your original reference .gtf file and start the whole process over again using one sample at a time.

If you really want to avoid the Tophat/Cufflinks paradigm, you could try RUM or Trinity or Trans-ABySS