I am a complete newbie when it comes to interpreting RNA-Seq information (I'm actually studying computer science), but I have an interest in learning more about scientific research to see if this is a career I'd like to pursue. For this reason, I decided to do a summer internship in a biology lab to learn more about wet-lab techniques and possibly use my computer knowledge to help the lab I'm in.
Currently I'm running into some problems that I'm sure someone more familiar with this data can easily point out. I have been able to download an archive of the disease of interest and I’m interested in the RNASeqV2 information to determine expression levels, so I retrieved that data. After extracting the files I have 28 samples that each have the following type of files. For example, I get the following files for 28 different samples
I have been told that the RNA expression is the most important so I’ve been focusing on interpreting the .rsem.genes.results file and the .rsem.genes.normalized_results file, but have been having difficulty. The first file is composed of 4 columns labeled gene_id, raw_count, scaled_estimate, and transcript id.
So I guess my first question is what is meant by the raw_count and scaled_estimate columns?
The second file (ie the .rsem.genes.normalized_results file) has only 2 columns labeled gene_id, and normalized_count. What is meant by normalized count?
Also, the people I work with have told me that having normal cells to act as a control versus the cancer cells is important. Does the normalized results file include this information?
Any information you guys can give me would be greatly appreciated.