Confusion About Observations vs. Variables in my Data
1
0
Entering edit mode
5.7 years ago
Ark ▴ 90

Hello,

I am very new to bioinformatics (and statistics) and have a basic question that I haven't been able to fully answer on my own.

I have been using R to do some basic analysis of RNA-seq data. I know that "tidy data" conventions say to organize my data by putting observations into rows and variables into columns. However, I am not sure which part of my data would be considered the "observations" and which are the "variables".

Below is R code to generate a line of data formatted similar to my dataset:

my_data <- data.frame(t(sample(20, 9)))
names(my_data) <- c("Subj_1_Cell_Type_X", "Subj_1_Cell_Type_Y", "Subj_1_Cell_Type_Z", 
                    "Subj_2_Cell_Type_X", "Subj_2_Cell_Type_Y", "Subj_2_Cell_Type_Z", 
                    "Subj_3_Cell_Type_X", "Subj_3_Cell_Type_Y", "Subj_3_Cell_Type_Z")
rownames(my_data) <- "Gene_ID"

In my data set, I have approximately 14,000 genes as rows and about 1,000 subject_cell_type columns.

I have left them in this format up until this point, however, I am not sure that this is correct. I have found many resources discussing the differences between observations and variables but for some reason the format of my data has left me unsure. I believe that I should transpose my data and consider the subject_cell_type as my observational units and the gene read counts as my variables.

Is this the correct interpretation?

Also, if anyone had any informative links on discerning between observations and variables, I would be really grateful!

Thank you!

pca k-means RNA-Seq R • 1.4k views
ADD COMMENT
2
Entering edit mode
5.7 years ago
Ram 43k

Good on you, OP, for thinking about data representation conventions. It is indeed helpful to follow robust standards.

In your case, you're on the right track - genes should indeed be rows, but it would also help if your data were in "long form". For example, you have 3 subjects and 3 cell types, so you have 9 columns representing the combinations. If you were to add Gene as a column (instead of using names), Subject (values being one of c(1,2,3)) as another column and Cell.Type as another column (values being one of c("X","Y","Z")), you could get away with just 4 columns (albeit 9 times as many rows). The 4th column would be the data, which would be the observation and all 3 other columns would be variables. This will make analysis easier down the line.

ADD COMMENT
0
Entering edit mode

Thank you for clearing that up for me! I'm relieved, as I was worried that the (limited) work I had already done was going to need to be scrapped because I forgot to check the structure of my data. I won't make this mistake again!

I will look into creating the columns that you mentioned also. Most of my confusion came from the combined subject/cell labels and splitting the two into distinct columns would definitely help me visualize and think about the data in a more intuitive way.

ADD REPLY
1
Entering edit mode

Hey, no problem! Also, I edited my answer because I counted the columns wrong. There will be 4, not 5 columns.

ADD REPLY

Login before adding your answer.

Traffic: 2510 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6