Question: Convert RPKM data to read count data matrix
1
gravatar for chtrgswm
5 months ago by
chtrgswm10
chtrgswm10 wrote:

I have RNA seq data for each sample in the following format : enter image description here

Each sample has data in this format for 22 chromosomes, in their respective csv files.

I require it count matrix format (gene in rows, samples in columns or vice-versa)

enter image description here

Is there are package in R or python that will allow me to achieve this? Please help.

python rna-seq R rpkm count data • 249 views
ADD COMMENTlink modified 5 months ago • written 5 months ago by chtrgswm10
1

Difficult as you would need the exact code that was used for the RPKM conversion including the effective total library size and gene length to reverse the actions. Even then it is difficult because this table is a non-standard format and you would need a custom approach. Is there any chance you get your hands on the raw data (fastq or at least aligned BAM) so that you can align or quantify from scratch? That would be the best and safest option. Is tag_count the raw counts?

ADD REPLYlink modified 5 months ago • written 5 months ago by ATpoint31k

There are sample wise fastq files. But it would be computation intensive, since there are many samples.

ADD REPLYlink written 5 months ago by chtrgswm10

Take a look at salmon, that should be fairly light.

ADD REPLYlink written 5 months ago by WouterDeCoster43k

Yep, and even if it takes time, it will make your life a lot easier. Salmon is really fast, and barely requires memory. I strongly recommend it over any back-conversion of what you have as table.

ADD REPLYlink written 5 months ago by ATpoint31k

Ok. Thank you. I have similar case from another study in which fastq files are not available - https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE65705. What might I do in this case?

ADD REPLYlink modified 5 months ago • written 5 months ago by chtrgswm10

They are accessable for download. The relevant information is the BioProject PRJNA274751. Use this one to get files directly from the ENA, see my tutorial on how to do that efficiently: Fast download of FASTQ files from the European Nucleotide Archive (ENA)

All relevant files are here: https://www.ebi.ac.uk/ena/data/view/PRJNA274751 You can download them one-by-one or as batch if you use the code in the linked tutorial.

The problem with back-conversion of table-like data is that you need 100% the code which created the table plus exact gene / feature length that were used for the size correction. Otherwise you create flawed raw counts that totally skew your analysis. It really should only be the absolute last option to do something like that.

ADD REPLYlink modified 5 months ago • written 5 months ago by ATpoint31k

Fastq files are available from EBI-ENA. Always use that source when you need fastq files.

ADD REPLYlink written 5 months ago by genomax80k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1887 users visited in the last hour