Question

Cufflinks with large BAM file

0

Entering edit mode

6.6 years ago

qudrat ▴ 100

What if I split the BAM file by chromosome using following command samtools view -b input.bam chr1 chr2 chr3 chr4 > output.bam and then Cufflinks is done. Would it produce false positive? Can somebody help me understand or give me some reference article? Actually it has been hampering my work for long and I would be extremely grateful.

RNA-Seq Assembly • 2.1k views

ADD COMMENT • link updated 6.6 years ago by andrew.j.skelton73 6.5k • written 6.6 years ago by qudrat ▴ 100

1

Entering edit mode

Your post has already been answered, and I agree with Andrew (a sentiment I echo in our other thread), but I believe that it also depends on the type of experiment that you're doing: The one part where you should not split by chromosome is if you are looking to use the normalised FPKM (or geometric) values from Cufflinks. Most people don't use these values from Cufflinks, and instead use Cufflinks for just producing an individual transcriptome reference GTF per sample. In this case, I cannot see how spliting the BAM and then anlysing separately is an issue. I would alsosplit any reference/guide GTF that you're supplying, too.

This guy agrees: http://seqanswers.com/forums/showthread.php?t=8866

This person has evidence that splitting is not ok: https://groups.google.com/forum/#!topic/tuxedo-tools-users/59Tk3qZlbp8

I would encourage you to run an experiment with a BAM file separely for chr1 and chr2, and then a third time using chr1+chr2. Your system should be capable to perform this. Let us know what results you find.

ADD REPLY • link 6.6 years ago by Kevin Blighe 87k

0

Entering edit mode

I run Cufflinks with BAM file for chr1 and chr2 separately and then a third time with BAM file of chr1+chr2 and the number of processed loci for chr1 is 46184 and for chr2 is 44979. For chr1+chr2, it 46184+44979=91163

ADD REPLY • link 6.6 years ago by qudrat ▴ 100

0

Entering edit mode

Thanks for the test! Did you use the -G parameter and provide a reference GTF?

Are the FPKM values in the results files the same?

ADD REPLY • link 6.6 years ago by Kevin Blighe 87k

0

Entering edit mode

I did not use the -G parameter, my command was: Cufflink -p 4 -o out.put BAM. FPKM values were different

ADD REPLY • link 6.6 years ago by qudrat ▴ 100

1

Entering edit mode

Yes, that makes sense. The FPKM values are different because the read counts are normalised, in part, depending on the distribution of counts across all chromosomes.

So, the ultimate question for you is this: Why did you originally want to use TopHat/Cufflinks?

If you just want to identify novel transcripts and then use another program to determine raw counts over these, then you can split the BAM and identify transcripts independently per chromosome (and merge them into a single transcriptome reference GTF at the end)
If you intend to use Cuffdiff, Cuffnorm, etc., then you cannot feasibly split the BAM by chromosome

As an example for you: I recently conducted an experiment where I just wanted to identify novel transcript regions in my FASTQ files. I ran TopHat2/Cufflinks independently on each, and produced a new consensus reference GTF. From this, overlaying these new identified transcripts with my reference FASTA genome, I then produced a new reference transcriptome in FASTA (using the gffread function in cufflinks). I then used Kalliso to determine raw counts in my samples over this new reference transcriptome.

In my example, I could have split the files by chromosome.

ADD REPLY • link 6.6 years ago by Kevin Blighe 87k

1

Entering edit mode

Thank you Kevin! My aim is to find novel transcript so I can split the BAM by chromosome.

ADD REPLY • link 6.6 years ago by qudrat ▴ 100

score 1 · Answer 1 · 2017-09-26

1

Entering edit mode

6.6 years ago

andrew.j.skelton73 6.5k

As far as I know Cufflinks is in a deprecated state and you should strongly consider the alternative of Stringtie - Better performance overall. Cufflinks works on RABT assembly so splitting by chromosome shouldn't cause too many significant issues. Your FP rate will depend more on your method of differential expression. Using Stringtie, you can prepare your abundance estimations for analysis with DESeq2, which is highly reputable in analysing RNA Seq data.

ADD COMMENT • link 6.6 years ago by andrew.j.skelton73 6.5k

2

Entering edit mode

Just to echo @Kevin's thoughtful response, I agree a sanity check of Chr1+2 vs Chr1, Chr2 individually is a great idea. I also second the fact that I only use transcriptional assembly methods to get the GTF file, and clean things up with TACO.

ADD REPLY • link 6.6 years ago by andrew.j.skelton73 6.5k