Gene expression values from ENCODE RNA-seq experiments
0
0
Entering edit mode
6.6 years ago

Hi there! I am struggling to figure out what could be the more sensible way to retrieve data for gene expression levels from RNA-seq experiments.

I am working on a statistical regression model to predict the expression levels for protein coding genes in the K562 cell-line. In order to construct my dataset, I would like to retrieve such data and use them as targets for my learning algorithm. I've started by taking a look to polyA mRNA RNA-seq experiments from ENCODE. K562 data can be found at this page here:

https://www.encodeproject.org/search/?type=Experiment&biosample_term_name=K562&assay_title=polyA+mRNA+RNA-seq&biosample_term_name=K562&assay_title=polyA+mRNA+RNA-seq

Stated that I am not interested in the expression levels for particular cell fractions, or for cells under treatments, the only experiments I might possibly use are the following:

ENCSR000CPH https://www.encodeproject.org/experiments/ENCSR000CPH/

ENCSR545DKY https://www.encodeproject.org/experiments/ENCSR545DKY/

ENCSR637VLS https://www.encodeproject.org/experiments/ENCSR637VLS/

ENCSR000AEM https://www.encodeproject.org/experiments/ENCSR000AEM/

ENCSR000AEO https://www.encodeproject.org/experiments/ENCSR000AEO/

ENCSR000AEQ https://www.encodeproject.org/experiments/ENCSR000AEQ/

ENCSR000AEP https://www.encodeproject.org/experiments/ENCSR000AEP/

Each experiment is characterized by two tsv gene quantification files (two experimental replicates).

Do I have to choose one experiment between those? How do I choose? Could it be more sensible to average the values from all of them or only some of them?

For example, one possibility can be to start discarding experiments with poor replicate concordance. Then, how to proceed? Is there a common or best practice that is followed in this kind of situations?

The point is that the expression levels from different experiments but for the same genes can sometimes vary a lot. As an example I've written a short python script to quickly visualize the values for some genes across the experiments. The output, for three selected genes is the following:

(values reported are FPKM)

_ HBG2 gene _

-> ENCSR000AEP  ['51188.44', '53798.70']

-> ENCSR000AEQ  ['49073.15', '57095.34']

-> ENCSR545DKY  ['7472.63', '7525.21']

-> ENCSR000AEO  ['12755.06', '17592.64']

-> ENCSR000AEM  ['14166.68', '12213.92']

-> ENCSR000CPH  ['29708.56', '24388.77']

-> ENCSR637VLS  ['1085.27', '3802.39']


_ EEF1A1 gene _

-> ENCSR000AEP  ['4602.76', '3554.07']

-> ENCSR000AEQ  ['5766.79', '5677.71']

-> ENCSR545DKY  ['7170.40', '7099.77']

-> ENCSR000AEO  ['7707.64', '8192.11']

-> ENCSR000AEM  ['7303.17', '7393.54']

-> ENCSR000CPH  ['12497.78', '14761.97']

-> ENCSR637VLS  ['17373.10', '14533.15']


_ RPS18 gene _

-> ENCSR000AEP  ['12494.79', '12824.74']

-> ENCSR000AEQ  ['9479.93', '10694.58']

-> ENCSR545DKY  ['4377.64', '4339.89']

-> ENCSR000AEO  ['4347.17', '4755.57']

-> ENCSR000AEM  ['7874.33', '7727.26']

-> ENCSR000CPH  ['7877.68', '10201.42']

-> ENCSR637VLS  ['8802.88', '6608.08']

You can notice, for instance, that the values for HBG2 gene vary from 1k to 51k, even if the experiments are all carried out on K562 without treatments (at least they should).

How to deal with this kind of situations?

Thanks in advance,

fabrizio

RNA-Seq mRNA ENCODE K562 • 2.5k views
ADD COMMENT
0
Entering edit mode

If the data is generated by different protocols ( sequencing instrument, stranded/unstranded, single-end/paired-end ) it is possible that the expression levels varies a lot. You need to normalize the data again to use them all together. I am not sure, which normalization methods, but that should be the next step.

ADD REPLY
0
Entering edit mode

As mentioned by geek, these differences in expression are expected. I've just looked at 2 of the projects to which you've linked and I see that different strand-specific protocols, sequencers, and sample prep protocols were used. Certain genes also are expected to exhibit high variability, like the HB subunit genes.

I would build a table that contains information on these various technical parameters, and then decide which experiments to do. Then, aim to get the FASTQ files in each case and work from those with a rapid aligner for count abundance like Kallisto. When you normalise the counts, you'll then have to adjust for experiment (batch), sequencer, paired/single-end, and anything else that's likely to bias counts.

It's perfectly reasonably to average counts over replicates, or first analyse them separately and see how they line up by PCA. I've done this in the past and saw correlations between replicates of >0.99999999 (Pearson) and P value of nil.

ADD REPLY

Login before adding your answer.

Traffic: 1964 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6