Question

Convert RPKM into Counts

1

Entering edit mode

20 months ago

Dhruv ▴ 10

I'm working on a project involving datasets from GEO. My project requires having a matrix with counts vs genes. However, a dataset (GSE57152) I am working with is formatted in a less than useful way.

It has a normalization matrix in RPKM instead of counts and no genes corresponding to the values. It has separate .txt files with a list of genes that were measured. As well as separate .txt files with raw data for each sample in the dataset (these files do not have corresponding gene names inside). What is the best way to get a matrix with counts and genes for this dataset?

When you open up the raw data .txt files for each sample this is what you get:

 ACAGCAGTCTGAGCGGAAATCCCG
0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
0.0000000.0000000.0000000.0000000.0000000.6700000.3300000.0000000.0000000.3500002.0600000.000000
0.0000000.0000000.0000005.9800000.1800000.0000000.0000000.0000000.0000000.0000000.0000000.000000
0.3700000.0000001.0400000.3200000.0000000.0000000.0000002.0100005.3700000.0000007.8600000.000000
2.7300001.17000024.3200000.0000006.32000026.0300000.0000000.0000001.9000000.0000000.0000000.000000
0.00000088.7300000.0000001.8600000.0000000.0000000.1800000.0000000.0000000.1700000.0000000.000000
7.4200000.0000000.0000000.0000002.3000009.6700002.8500000.0000000.0000004.5200005.98000063.160000
0.0000000.1700000.0000000.0000007.3500000.7800001.6300000.00000

gene data r expression • 486 views

ADD COMMENT • link updated 20 months ago by abbey ▴ 210 • written 20 months ago by Dhruv ▴ 10

score 1 · Answer 1 · 2022-08-09

GEO datasets often have additional supplementary files along with the main data. At the bottom of the page you linked, "GSE57152_readme.txt" has details explaining that you use the supplementary file "gene_list.txt" to determine what gene each row corresponds to:

Gene_list.txt List of associated RefSeq gene names corresponding to rows in Sample*.txt files (see below).

Sample.txt Sample.txt files contain RPKM values of genes for each of the SQUARE matrix fields. Each field represents a transcript with one of the 12 different ending dinucleotides: 'AC' 'AG' 'CA' 'GT' 'CT' 'GA' 'GC' 'GG' 'AA' 'AT' 'CC' 'CG'

Columns correspond to matrix fields. Rows correspond to genes according to Gene_list.txt file. Gene list order is identical to all Sample*.txt files. First row is the 12 SQUARE ending dinucleotides.

NormMatrix.txt Scaling factor based on the mean PCR field yield after patient normalization. These values should be used normalize RPKM values per patient per field.

I'm not super familiar with microarray/expression data in this format, but I believe the samples.txt FPKM data plus the scaling factors in "NormMatrix.txt" should allow you to back-calculate the original counts for each sample. The readme also gives details on how the analysis was done, which might help:

Details for read alignment Sequence and quality files, in Lifetech proprietary .xsq binary format, were mapped against the GRGCh37/hg19 version of the Homo sapiens genome using the Lifetech Lifescope 2.5.1 whole Transcriptome analysis pipeline. The files produced by this analytical pipeline were coverage; alignment (.bam files); exon junction; gene expression in RPKM with reference to the RefSeq gene structure; read counts with reference to each gene. Quality control metrics were generated both with the Lifetech suite and the Integromics SeqSolve analysis suite on all the samples. With the TopHat 2.0.11 and Cufflinks 2.1.1 suite against GRCh37 ENSEMBL hg19 genome sequence and associated ENSEMBL exon / transcript annotation in .gtf format, hence excluding 'ab initio' assembled transcripts. Tables of read counts per gene were generated from the alignments using the HTSEQ package. Read lengths were 75 nucleotides, fragments, with a percentage of genome alignments over the whole sequence length over 80%. The minimal sequence base quality value selected for further processing was 10 (Phred score). Bases with a quality value below this parameter were replaced with 'N'. Progressive alignment method was selected. The minimum genome alignment quality value for an alignment to be processed was again 10 (Phred value). Only primary alignments were considered for gene counts and quantification. The minimal identity seed for alignment extension was 25 nucleotides. The genome mapping percentage of the libraries was always between 60% and 80% of the initial transcripts.