Question: Difference In Cufflinks Transcripts With Merging 3 Cufflink Outputs And By Pooling All 3 Libraries?
0
gravatar for Rahul Sharma
5.4 years ago by
Rahul Sharma570
Germany
Rahul Sharma570 wrote:

Hi,

I first assembled a genome of size 80MB using four Illumina libraries. After masking the repeat elements, I am trying to generate the gene using RNA-Seq from cufflinks. I have three Single RNA-Seq Illumina libraries (Isolated at three different time points, from pure to infection phase of pathogen). In the first run, I first ran tophat2 and then cufflinks to get the transcripts from all three libraries individually.

1). In the first run, I first ran tophat2 and then cufflinks to get the transcripts from all three libraries individually.

nohup tophat2 --num-threads 35 --b2-very-sensitive -o $I/Tophat_out_MAsked_genome_lib1 $I/SCa_gtr_300_discarded_90_99per_Ns_for_Masked_Tophat.fa.index $D/lib1.fastq &

nohup tophat2 --num-threads 35 --b2-very-sensitive -o $I/Tophat_out_MAsked_genome_lib2 $I/SCa_gtr_300_discarded_90_99per_Ns_for_Masked_Tophat.fa.index $D/lib2.fastq &

nohup tophat2 --num-threads 35 --b2-very-sensitive -o $I/Tophat_out_MAsked_genome_lib3 $I/SCa_gtr_300_discarded_90_99per_Ns_for_Masked_Tophat.fa.index $D/lib3.fastq &

This generated the accepted_hits.bam file I used in Cufflinks like this:

cufflinks -o Cufflinks_all/ -p 30 -L Ph ./Tophat_out_MAsked_genome_All_RNA_Seq/accepted_hits.bam

From the above described way I got around 800, 19,000 and 20,000 transcripts for lib1, lib2 and lib3, respectively. Then I merged the transcripts using cuffmerge, command was:

cuffmerge -s $I/SCa_gtr_300_discarded_90_99per_Ns_for_Masked_Tophat.fa $I/assemblies.txt

Cuffmerge generated around 17,500 transcripts.

2). In the Second run I pooled all three libraries and run tophat2 and Cufflinks on the single dataset. This generated ~21,000 transcripts.

My question is, which strategy should I follow? I am also interested in finding out differentially expressed genes using Cuffdiff. What should be the input .gtf file for Cuffdiff? The file generated by 1st method or the 2nd one?

I would really appreciate your comments on this. Thank you so much in advance!

Best regards and wishes,

Rahul

cufflinks tophat2 rna-seq • 2.7k views
ADD COMMENTlink modified 4.9 years ago by Biostar ♦♦ 20 • written 5.4 years ago by Rahul Sharma570

Are the three libraries from the same sample or three different samples? That will determine which method is preferred.

ADD REPLYlink written 5.4 years ago by Devon Ryan89k

Hi, Thanks for your reply! They are from three different time-points of pathogen infection. But if my first goal is to find all genes in this species, should I use the gtf file from method2? But later, I will also look for differentially expressed genes. Many thanks!

ADD REPLYlink written 5.4 years ago by Rahul Sharma570

1) Why cuffmerge and not cuffcompare? See also this thread: RNA-seq with cuffdiff: use merged.gtf from cuffmerge or combined.gtf from cuffcompare? 2) Personally, I think both methods are reasonable, but maybe you can remove the first library giving only 800 transcripts. Maybe that's part of the relatively large differnece in the total number of transcripts.

ADD REPLYlink written 5.4 years ago by Fabio Marroni2.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1464 users visited in the last hour