Question: Cufflinks with large BAM file
0
gravatar for qudrat
2.2 years ago by
qudrat70
NATIONAL INSTITUTE OF IMMUNOLOGY, INDIA
qudrat70 wrote:

What if I split the BAM file by chromosome using following command samtools view -b input.bam chr1 chr2 chr3 chr4 > output.bam and then Cufflinks is done. Would it produce false positive? Can somebody help me understand or give me some reference article? Actually it has been hampering my work for long and I would be extremely grateful.

rna-seq assembly • 845 views
ADD COMMENTlink modified 2.2 years ago by andrew.j.skelton735.9k • written 2.2 years ago by qudrat70
1

Your post has already been answered, and I agree with Andrew (a sentiment I echo in our other thread), but I believe that it also depends on the type of experiment that you're doing: The one part where you should not split by chromosome is if you are looking to use the normalised FPKM (or geometric) values from Cufflinks. Most people don't use these values from Cufflinks, and instead use Cufflinks for just producing an individual transcriptome reference GTF per sample. In this case, I cannot see how spliting the BAM and then anlysing separately is an issue. I would alsosplit any reference/guide GTF that you're supplying, too.

This guy agrees: http://seqanswers.com/forums/showthread.php?t=8866

This person has evidence that splitting is not ok: https://groups.google.com/forum/#!topic/tuxedo-tools-users/59Tk3qZlbp8

I would encourage you to run an experiment with a BAM file separely for chr1 and chr2, and then a third time using chr1+chr2. Your system should be capable to perform this. Let us know what results you find.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Kevin Blighe52k

I run Cufflinks with BAM file for chr1 and chr2 separately and then a third time with BAM file of chr1+chr2 and the number of processed loci for chr1 is 46184 and for chr2 is 44979. For chr1+chr2, it 46184+44979=91163

ADD REPLYlink written 2.2 years ago by qudrat70

Thanks for the test! Did you use the -G parameter and provide a reference GTF?

Are the FPKM values in the results files the same?

ADD REPLYlink written 2.2 years ago by Kevin Blighe52k

I did not use the -G parameter, my command was: Cufflink -p 4 -o out.put BAM. FPKM values were different

ADD REPLYlink written 2.2 years ago by qudrat70
1

Yes, that makes sense. The FPKM values are different because the read counts are normalised, in part, depending on the distribution of counts across all chromosomes.

So, the ultimate question for you is this: Why did you originally want to use TopHat/Cufflinks?

  • If you just want to identify novel transcripts and then use another program to determine raw counts over these, then you can split the BAM and identify transcripts independently per chromosome (and merge them into a single transcriptome reference GTF at the end)

  • If you intend to use Cuffdiff, Cuffnorm, etc., then you cannot feasibly split the BAM by chromosome

As an example for you: I recently conducted an experiment where I just wanted to identify novel transcript regions in my FASTQ files. I ran TopHat2/Cufflinks independently on each, and produced a new consensus reference GTF. From this, overlaying these new identified transcripts with my reference FASTA genome, I then produced a new reference transcriptome in FASTA (using the gffread function in cufflinks). I then used Kalliso to determine raw counts in my samples over this new reference transcriptome.

In my example, I could have split the files by chromosome.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Kevin Blighe52k
1

Thank you Kevin! My aim is to find novel transcript so I can split the BAM by chromosome.

ADD REPLYlink written 2.2 years ago by qudrat70
1
gravatar for andrew.j.skelton73
2.2 years ago by
London
andrew.j.skelton735.9k wrote:

As far as I know Cufflinks is in a deprecated state and you should strongly consider the alternative of Stringtie - Better performance overall. Cufflinks works on RABT assembly so splitting by chromosome shouldn't cause too many significant issues. Your FP rate will depend more on your method of differential expression. Using Stringtie, you can prepare your abundance estimations for analysis with DESeq2, which is highly reputable in analysing RNA Seq data.

ADD COMMENTlink written 2.2 years ago by andrew.j.skelton735.9k
2

Just to echo @Kevin's thoughtful response, I agree a sanity check of Chr1+2 vs Chr1, Chr2 individually is a great idea. I also second the fact that I only use transcriptional assembly methods to get the GTF file, and clean things up with TACO.

ADD REPLYlink written 2.2 years ago by andrew.j.skelton735.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 732 users visited in the last hour