I wondered about the appropriate gene annotation for finding differential expression with Cufflinks in human RNA-Seq data without finding new splice junctions (using hg19).
My test data is from the SRA: http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP000698
As an annotation file, I downloaded Homo_sapiens.GRCh37.63.gtf (as seen in other tutorials) and as genome I use hg19, so no assembly was done (and I skipped the cufflinks step and started directly with cuffdiff). I ask this question because I got no differential expression with cuffdiff because all tests failed with NOTEST. But viewing the bam-files with IGV they are well aligend at exons.
This is my pipeline (ok, I pooled the samples, this seems not to be the best way, but anyway, I should get some results), based on http://vallandingham.me/RNA_seq_differential_expression.html
tophat -G /ngs_process/projects/References/annotations/Homo_sapiens.GRCh37.63.gtf -p 4 -r 160 -o output/SRR0278/top_SRR027863-65_withannotation ../References/hg19/hg19 fastq/SRR0278/SRR027863_1.fastq,fastq/SRR0278/SRR027864_1.fastq,fastq/SRR0278/SRR027865_1.fastq fastq/SRR0278/SRR027863_2.fastq,fastq/SRR0278/SRR027864_2.fastq,fastq/SRR0278/SRR027865_2.fastq tophat -G /ngs_process/projects/References/annotations/Homo_sapiens.GRCh37.63.gtf -p 4 -r 160 -o output/SRR0278/top_SRR027866-67_withannotation ../References/hg19/hg19 fastq/SRR0278/SRR027866_1.fastq,fastq/SRR0278/SRR027867_1.fastq fastq/SRR0278/SRR027866_2.fastq,fastq/SRR0278/SRR027867_2.fastq cuffdiff -p 4 -L actived,resting -o diff_SRR027863-65vs66-67_withannotation /ngs_process/projects/References/annotations/Homo_sapiens.GRCh37.63.gtf top_SRR027863-65_withannotation/accepted_hits.bam top_SRR027866-67_withannotation/accepted_hits.bam
When looking at the file geneexp.diff, I got useless results all the way (head geneexp.diff):
test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 ln(fold_change) test_stat p_value q_value significant ENSG00000000003 ENSG00000000003 TSPAN6 X:99883666-99894988 actived resting NOTEST 0 0 0 0 1 1 no ENSG00000000005 ENSG00000000005 TNMD X:99839798-99854882 actived resting NOTEST 0 0 0 0 1 1 no ENSG00000000419 ENSG00000000419 DPM1 20:49551403-49575092 actived resting NOTEST 0 0 0 0 1 1 no ENSG00000000457 ENSG00000000457 SCYL3 1:169631244-169863408 actived resting NOTEST 0 0 0 0 1 1 no ENSG00000000460 ENSG00000000460 C1orf112 1:169631244-169863408 actived resting NOTEST 0 0 0 0 1 1 no ENSG00000000938 ENSG00000000938 FGR 1:27938574-27961788 actived resting NOTEST 0 0 0 0 1 1 no ENSG00000000971 ENSG00000000971 CFH 1:196621007-196716634 actived resting NOTEST 0 0 0 0 1 1 no ENSG00000001036 ENSG00000001036 FUCA2 6:143816613-143832827 actived resting NOTEST 0 0 0 0 1 1 no ENSG00000001084 ENSG00000001084 GCLC 6:53362138-53481969 actived resting NOTEST 0 0 0 0 1 1 no
Using samtools flagstat topSRR027863-65withannotation/accepted_hits.bam:
28324268 in total 0 QC failure 0 duplicates 28324268 mapped (100.00%) 28324268 paired in sequencing 14290946 read1 14033322 read2 20901782 properly paired (73.79%) 23773098 with itself and mate mapped 4551170 singletons (16.07%) 0 with mate mapped to a different chr 0 with mate mapped to a different chr (mapQ>=5)
And samtools flagstat topSRR027866-67withannotation/accepted_hits.bam:
14750160 in total 0 QC failure 0 duplicates 14750160 mapped (100.00%) 14750160 paired in sequencing 7202896 read1 7547264 read2 10245492 properly paired (69.46%) 12095724 with itself and mate mapped 2654436 singletons (18.00%) 0 with mate mapped to a different chr 0 with mate mapped to a different chr (mapQ>=5)
So I wondered if this annotation file is the best to use in this case. Or why do I get no useful results? I provided tophat and cuffdiff the same annotation file, so even if its wrong, it should give me some results.
Are there any ideas what went wrong? Running without any annotation as described in the Cufflinks tutorial ( http://cufflinks.cbcb.umd.edu/tutorial.html ), I get some results with differential expression (my old pipeline: http://biostar.stackexchange.com/questions/11401/meaning-of-fpkm-value-used-by-cufflinks )
I also found http://cufflinks.cbcb.umd.edu/igenomes.html that provided an annotation package: ftp://igenome:G3nom3s4u@ftp.illumina.com/Homo_sapiens/UCSC/hg19/Homo_sapiens_UCSC_hg19.tar.gz but I'm not sure if the gene.gtf file I found in the package is appropriate. Where is the difference to Homo_sapiens.GRCh37.63.gtf ?
I would be happy if anyone has any hints for me :-)
Thanks in advance, Oliver
cuffdiff v1.0.3 (2403) TopHat v1.3.1