Question: How To Calculate The Gene Expression Level Based On Rna-Seq Experiment In Encode For Protein Coding Gene In Gencode
0
gravatar for Michael Z
5.5 years ago by
Michael Z0
Michael Z0 wrote:

Currently I am using the protein-coding genes from GENCODE as a stable source of gene annotations, however when I try to find gene expression data for these genes, I am confused.

There are RNA-seq data for each cell line, such as K562, and it has a bigWig file called "Transcription of K562 cells from ENCODE" which seems like the expression level on some scale, but I do not find the detailed information about how they calculated it.

Forgive me if it is a simple question, I am completely new to the RNA-seq: should I start from the bam files of alignment for each replicates of the RNA-seq, and count how many of the reads falling on the gene body regions, divided by the total number of reads in the replicate to get the RPKM?

Or can I simply use the value from the bigWig files, then use the sum of the values falling on a gene body, and do some extra normalization?

Thanks!

gene-expression rna-seq • 5.9k views
ADD COMMENTlink modified 12 days ago by Biostar ♦♦ 20 • written 5.5 years ago by Michael Z0
1
gravatar for dario.garvan
5.5 years ago by
dario.garvan440
Australia
dario.garvan440 wrote:

You can't use bigWig files to do counting. There's no way to figure out how many reads generated the pileup at a particular position. You will need to use mapped data. However, note that ENCODE's mappings used TopHat 1.0.14 which had some important bugs in it. One of them was it would map to pseudogenes instead of splice junctions, even if the mapping was better to the splice junction. Map the reads to the genome using the latest version of TopHat, or alternatively, STAR.

ADD COMMENTlink written 5.5 years ago by dario.garvan440
1
gravatar for Manvendra Singh
5.5 years ago by
Manvendra Singh2.0k
Berlin, Germany
Manvendra Singh2.0k wrote:

Dario is right, Easiest is that you can wget the fastq file of K562 from Encode, fetch genome sequence and index it by bowtie2, run tophat2 on it. In output, you would see a accepted_hits.bam file, Now you need to run Cufflinks on it providing proper gtf files where your protein coding genes are there from genecode. if you do with replicates then Cuffcompare it. then the final gtf file would be containing FPKM values for each genes in dataset.

ADD COMMENTlink written 5.5 years ago by Manvendra Singh2.0k
0
gravatar for Michael Z
5.5 years ago by
Michael Z0
Michael Z0 wrote:

Thanks Dario and Manu, I will try your methods!

ADD COMMENTlink written 5.5 years ago by Michael Z0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1801 users visited in the last hour