Convert RPKM data to read count data matrix
1
2
Entering edit mode
18 months ago
chtrgswm ▴ 20

I have RNA seq data for each sample in the following format :

Each sample has data in this format for 22 chromosomes, in their respective csv files.

I require it count matrix format (gene in rows, samples in columns or vice-versa)

Is there are package in R or python that will allow me to achieve this? Please help.

RPKM Count Data RNA-Seq R python • 935 views
1
Entering edit mode

Difficult as you would need the exact code that was used for the RPKM conversion including the effective total library size and gene length to reverse the actions. Even then it is difficult because this table is a non-standard format and you would need a custom approach. Is there any chance you get your hands on the raw data (fastq or at least aligned BAM) so that you can align or quantify from scratch? That would be the best and safest option. Is tag_count the raw counts?

0
Entering edit mode

There are sample wise fastq files. But it would be computation intensive, since there are many samples.

0
Entering edit mode

Take a look at salmon, that should be fairly light.

0
Entering edit mode

Yep, and even if it takes time, it will make your life a lot easier. Salmon is really fast, and barely requires memory. I strongly recommend it over any back-conversion of what you have as table.

0
Entering edit mode

Ok. Thank you. I have similar case from another study in which fastq files are not available - https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE65705. What might I do in this case?

0
Entering edit mode

They are accessable for download. The relevant information is the BioProject PRJNA274751. Use this one to get files directly from the ENA, see my tutorial on how to do that efficiently: Fast download of FASTQ files from the European Nucleotide Archive (ENA)

All relevant files are here: https://www.ebi.ac.uk/ena/data/view/PRJNA274751 You can download them one-by-one or as batch if you use the code in the linked tutorial.

The problem with back-conversion of table-like data is that you need 100% the code which created the table plus exact gene / feature length that were used for the size correction. Otherwise you create flawed raw counts that totally skew your analysis. It really should only be the absolute last option to do something like that.

0
Entering edit mode

Fastq files are available from EBI-ENA. Always use that source when you need fastq files.