I am relatively new to bioinformatics and I am a Computer Scientist, not a biologist so, I apologize in advance if the question may sound little bit silly.
I am working to a model dealing with several kind of data, including gene expression data.
For the moment, I am focusing on the canonical protein-coding genes with data from the experiments performed on the K562 cell line available on the ENCODE portal. I have already collected different data, but now my issue is to associate to each gene I am focusing on - canonical, protein coding in K562 - values for their expression level.
Now, I made some reasonings I am not sure of, so I would appreciate if you could kindly give me some feedbacks.
- I thought the gene expression level have to be collected from RNA-seq experiments. (correct?)
- Since I am interested in the expression levels I should not look for any kind of RNA-seq experiments, but only those focusing on mRNA. (correct?)
I went to https://www.encodeproject.org/matrix/?type=Experiment&biosample_term_name=K562 and start wondering which RNA-seq experiment to consider among:
- shRNA RNA-seq
- siRNA RNA-seq
- polyA mRNA RNA-seq
- CRISPR RNA-seq
- total RNA-seq
- small RNA-seq
- polyA depleted RNA-seq
(can be found by clicking on " + See more..." on Assay, in the header)
- I thought the only eligible experiments are: polyA mRNA RNA-seq and polyA depleted RNA-seq. (correct?)
What is the difference between the two experiments? I read that, generally, it is the polyA tail feature that allows to distinguish mRNA from other kind of RNA found in the cell, and so it is used to filter out the mRNA in the library preparation step. I could not find what 'polyA depleted RNA-seq' stands for. :/ May it refer to the procedure aiming at better get rid of the rRNA?
- In that case, how should I choose between the two?
Going further, by selecting a specific type of experiment, say polyA mRNA RNA-seq (https://www.encodeproject.org/search/?type=Experiment&biosample_term_name=K562&assay_title=polyA+mRNA+RNA-seq&biosample_term_name=K562&assay_title=polyA+mRNA+RNA-seq), I noticed that there are many other specifications: while some experiments just report: "Homo sapiens K562", some others report something like "Homo sapiens K562 insoluble cytoplasmic fraction" or "Homo sapiens K562 nuclear fraction" or "Homo sapiens K562 membrane fraction". What do these further specifications refer to? Does it maybe refer to the place were the RNA is extracted? (sorry, I am a n00b :/)
- How should I choose among these experiments? Considering I am interested in the overall gene expression, am I supposed to choose experiments from the cytoplasm so that I can possibly take into account the effects of regulatory short RNAs over the transcribed mRNA (occurring in the cytoplasm)?
I know this question is quite long, but if you could only kindly answer to some points it would be very helpful! Thank you (: