Hg19 Gene Annotation With Tophat/Cufflinks
2
3
Entering edit mode
12.7 years ago
Oliver ▴ 240

Hello folks,

I wondered about the appropriate gene annotation for finding differential expression with Cufflinks in human RNA-Seq data without finding new splice junctions (using hg19).

My test data is from the SRA: http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP000698

As an annotation file, I downloaded Homo_sapiens.GRCh37.63.gtf (as seen in other tutorials) and as genome I use hg19, so no assembly was done (and I skipped the cufflinks step and started directly with cuffdiff). I ask this question because I got no differential expression with cuffdiff because all tests failed with NOTEST. But viewing the bam-files with IGV they are well aligend at exons.

This is my pipeline (ok, I pooled the samples, this seems not to be the best way, but anyway, I should get some results), based on http://vallandingham.me/RNA_seq_differential_expression.html

tophat -G /ngs_process/projects/References/annotations/Homo_sapiens.GRCh37.63.gtf -p 4 -r 160 -o output/SRR0278/top_SRR027863-65_withannotation ../References/hg19/hg19 fastq/SRR0278/SRR027863_1.fastq,fastq/SRR0278/SRR027864_1.fastq,fastq/SRR0278/SRR027865_1.fastq fastq/SRR0278/SRR027863_2.fastq,fastq/SRR0278/SRR027864_2.fastq,fastq/SRR0278/SRR027865_2.fastq
tophat -G /ngs_process/projects/References/annotations/Homo_sapiens.GRCh37.63.gtf -p 4 -r 160 -o output/SRR0278/top_SRR027866-67_withannotation ../References/hg19/hg19 fastq/SRR0278/SRR027866_1.fastq,fastq/SRR0278/SRR027867_1.fastq fastq/SRR0278/SRR027866_2.fastq,fastq/SRR0278/SRR027867_2.fastq

cuffdiff -p 4 -L actived,resting -o diff_SRR027863-65vs66-67_withannotation /ngs_process/projects/References/annotations/Homo_sapiens.GRCh37.63.gtf top_SRR027863-65_withannotation/accepted_hits.bam top_SRR027866-67_withannotation/accepted_hits.bam

When looking at the file geneexp.diff, I got useless results all the way (head geneexp.diff):

test_id gene_id gene    locus   sample_1        sample_2        status  value_1 value_2 ln(fold_change) test_stat       p_value q_value significant
ENSG00000000003 ENSG00000000003 TSPAN6  X:99883666-99894988     actived resting NOTEST  0       0       0       0       1       1       no
ENSG00000000005 ENSG00000000005 TNMD    X:99839798-99854882     actived resting NOTEST  0       0       0       0       1       1       no
ENSG00000000419 ENSG00000000419 DPM1    20:49551403-49575092    actived resting NOTEST  0       0       0       0       1       1       no
ENSG00000000457 ENSG00000000457 SCYL3   1:169631244-169863408   actived resting NOTEST  0       0       0       0       1       1       no
ENSG00000000460 ENSG00000000460 C1orf112        1:169631244-169863408   actived resting NOTEST  0       0       0       0       1       1       no
ENSG00000000938 ENSG00000000938 FGR     1:27938574-27961788     actived resting NOTEST  0       0       0       0       1       1       no
ENSG00000000971 ENSG00000000971 CFH     1:196621007-196716634   actived resting NOTEST  0       0       0       0       1       1       no
ENSG00000001036 ENSG00000001036 FUCA2   6:143816613-143832827   actived resting NOTEST  0       0       0       0       1       1       no
ENSG00000001084 ENSG00000001084 GCLC    6:53362138-53481969     actived resting NOTEST  0       0       0       0       1       1       no

Using samtools flagstat topSRR027863-65withannotation/accepted_hits.bam:

28324268 in total
0 QC failure
0 duplicates
28324268 mapped (100.00%)
28324268 paired in sequencing
14290946 read1
14033322 read2
20901782 properly paired (73.79%)
23773098 with itself and mate mapped
4551170 singletons (16.07%)
0 with mate mapped to a different chr
0 with mate mapped to a different chr (mapQ>=5)

And samtools flagstat topSRR027866-67withannotation/accepted_hits.bam:

14750160 in total
0 QC failure
0 duplicates
14750160 mapped (100.00%)
14750160 paired in sequencing
7202896 read1
7547264 read2
10245492 properly paired (69.46%)
12095724 with itself and mate mapped
2654436 singletons (18.00%)
0 with mate mapped to a different chr
0 with mate mapped to a different chr (mapQ>=5)

So I wondered if this annotation file is the best to use in this case. Or why do I get no useful results? I provided tophat and cuffdiff the same annotation file, so even if its wrong, it should give me some results.

Are there any ideas what went wrong? Running without any annotation as described in the Cufflinks tutorial ( http://cufflinks.cbcb.umd.edu/tutorial.html ), I get some results with differential expression (my old pipeline: http://biostar.stackexchange.com/questions/11401/meaning-of-fpkm-value-used-by-cufflinks )

I also found http://cufflinks.cbcb.umd.edu/igenomes.html that provided an annotation package: ftp://igenome:G3nom3s4u@ftp.illumina.com/Homo_sapiens/UCSC/hg19/Homo_sapiens_UCSC_hg19.tar.gz but I'm not sure if the gene.gtf file I found in the package is appropriate. Where is the difference to Homo_sapiens.GRCh37.63.gtf ?

I would be happy if anyone has any hints for me :-)

Thanks in advance, Oliver

Further details:

cuffdiff v1.0.3 (2403)
TopHat v1.3.1
cuffdiff tophat rna gene annotation • 12k views
ADD COMMENT
4
Entering edit mode
12.7 years ago

Did you make sure that the sequence names in your original fasta file and those in the GTF file match? Ensembl typically uses chromosome names without the "chr" in front whereas the fasta files from UCSC, for example, use "chr" in front of the chromosome names. If there is a mismatch, cuffdiff will think that no reads mapped to the transcripts because the chromosome names did not match.

ADD COMMENT
1
Entering edit mode

Ok, actually I tried the gene.gtf file from iGenome and this works in some way since the exons are identified by the regions on their chromosomes and not by an scaffold id. I got useful test results for 66% of the genes in the file, the other 34% fail. Not that satisfying at all but better than before :-)

ADD REPLY
1
Entering edit mode

As for genes "failing", keep in mind that in any given differentiated tissue, not all genes will be expressed AT ALL. So, 65% of genes expressed does not sound all that unusual to me.

ADD REPLY
0
Entering edit mode

Thanks for that hint, Sean. Actually, both files are from UCSC. But indeed, the hg19 sequence names start with 'chr..' (and actually nothing follows despite the chromosome number ;-)) and the gtf-file entries contain in the first column something like 'GL000213.1' ... so this is no chromosome number. Could this be the problem? If so, how do I solve it?

ADD REPLY
0
Entering edit mode

The chromosome names need to match in the reference to which you aligned and the gtf file.

ADD REPLY
0
Entering edit mode
11.4 years ago
yangzou306 • 0

Dear Sean Davis: I have the same problem when I use tophat and cufflinks, do you fix it now? can you give me the annotation.gtf downloaded from ucsc, thanks very much.

ADD COMMENT
1
Entering edit mode

You can get data from UCSC in GTF format using the Table Browser.

ADD REPLY
0
Entering edit mode

Sir Davis, you save my day.

ADD REPLY

Login before adding your answer.

Traffic: 2539 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6