How To Calculate The Gene Expression Level Based On Rna-Seq Experiment In Encode For Protein Coding Gene In Gencode
3
0
Entering edit mode
10.6 years ago
Michael Z • 0

Currently I am using the protein-coding genes from GENCODE as a stable source of gene annotations, however when I try to find gene expression data for these genes, I am confused.

There are RNA-seq data for each cell line, such as K562, and it has a bigWig file called "Transcription of K562 cells from ENCODE" which seems like the expression level on some scale, but I do not find the detailed information about how they calculated it.

Forgive me if it is a simple question, I am completely new to the RNA-seq: should I start from the bam files of alignment for each replicates of the RNA-seq, and count how many of the reads falling on the gene body regions, divided by the total number of reads in the replicate to get the RPKM?

Or can I simply use the value from the bigWig files, then use the sum of the values falling on a gene body, and do some extra normalization?

Thanks!

rna-seq gene-expression • 7.5k views
ADD COMMENT
1
Entering edit mode
10.6 years ago
dario.garvan ▴ 520

You can't use bigWig files to do counting. There's no way to figure out how many reads generated the pileup at a particular position. You will need to use mapped data. However, note that ENCODE's mappings used TopHat 1.0.14 which had some important bugs in it. One of them was it would map to pseudogenes instead of splice junctions, even if the mapping was better to the splice junction. Map the reads to the genome using the latest version of TopHat, or alternatively, STAR.

ADD COMMENT
1
Entering edit mode
10.6 years ago
Manvendra Singh ★ 2.2k

Dario is right, Easiest is that you can wget the fastq file of K562 from Encode, fetch genome sequence and index it by bowtie2, run tophat2 on it. In output, you would see a accepted_hits.bam file, Now you need to run Cufflinks on it providing proper gtf files where your protein coding genes are there from genecode. if you do with replicates then Cuffcompare it. then the final gtf file would be containing FPKM values for each genes in dataset.

ADD COMMENT
0
Entering edit mode
10.6 years ago
Michael Z • 0

Thanks Dario and Manu, I will try your methods!

ADD COMMENT

Login before adding your answer.

Traffic: 1888 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6