Question: gene expression analysis using RNA seq data
0
gravatar for maryam.akram001
4 weeks ago by
maryam.akram00110 wrote:

I am new with RNA seq. I want to identify Cancerous genes using RNA seq gene expression data but unable to understand what does the values in FPKM normalized data indicate e.g does zero means that gene is not expressed and larger values means gene is highly expressed? Will it be ok to use FPKM normalized RNA seq data for differential gene expression analysis? Plus the data I am using is log2 transformed...

Thanking in anticipation..

sequencing rna-seq gene • 421 views
ADD COMMENTlink modified 4 weeks ago by RamRS18k • written 4 weeks ago by maryam.akram00110
2
gravatar for Kevin Blighe
4 weeks ago by
Kevin Blighe30k
Republic of Ireland
Kevin Blighe30k wrote:

Will it be ok to use FPKM normalized RNA seq data for differential gene expression analysis ????

It will not be okay to use FPKM values for differential expression analyses, even if you have seen it in some major publication. Statisticians / bio-statisticians are not accustom to peer review those publications, but they ought to.

Please read these:

1, What the FPKM? A review of RNA-Seq expression units

In bold, at the top (regarding RPKM, FPKM, TPM, and CPM):

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

2, A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

In their key points:

The Total Count and RPKM normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

[Note - FPKM is essentially the same as RPKM]

--------------------------------------------------------

As sophiespo mentions, try to obtain the raw counts and then feed these into one of the more serious differential expression analysis tools.

The community needs to abandon RPKM and FPKM data.

Kevin

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Kevin Blighe30k
1

I very much agree that having p-values calculated from counts can be useful (and it important to be familiar with programs like edgeR / DESeq2), but I strongly disagree with the conclusion that "The community needs to abandon RPKM and FPKM data."

For me, I think having the independently calculated FPKM values can be extremely important in terms of testing the differential expression methods (which should be done for each project). For example, you have a gene that is known to be affected, shows clear differential expression in the FPKM values, but may not be differentially expressed with at least one method for your particular dataset. You can also visually compare heatmaps with genes identified using different methods this way.

I admittedly don't exactly have a formal publication to exactly show this (I am currently trying to help get the separate studies to the point of publication), but I do have a benchmark from a little while ago:

Warden et al. 2013 (click here for PDF)

In that situation, the number of samples is more than you would probably expect in a typical experiment, but the results match what I would typically expect in that log2(FPKM+0.1) values to be relatively more conservative than edgeR/DESeq2, and that matches most of what you see in Figure 6 (although there were some situations where the DESeq1 2-variable model, adjusting for patient, produced the gene list with the highest proportion of genes identified using all 3 methods in that study).

ADD REPLYlink modified 28 days ago • written 28 days ago by Charles Warden5.5k
1

Thanks for the input Charles! I can indeed see the utility of RPKM / FPKM values in certain situations. I probably should have said that 'the community ought to abandon ... when conducting differential expression analyses on large numbers of samples'. 'Large' is undefined here. Equally as there are critics of FPKM, there are critics of DESeq, EdgeR, and Limma. There are even critics of the BWA aligner. It is good to have diverse input / ideas on things.

ADD REPLYlink written 28 days ago by Kevin Blighe30k
1

Hi Kevin,

Actually, the study above used a fairly large amount of RNA-Seq samples: ERP001058

I know that some of the methods change the workflow a bit depending upon the sample size, and it would probably be interesting to see the effects of sample size on parameter estimation in different situations.

However, I would probably say most of my more recent experience has been benchmarks with smaller datasets (less than ~30 samples), which is likely what I think most people would be dealing with.

Also, I can definitely see how there could be some confusion, particularly since there was a paired comparison with a smaller number of samples paired from different technologies (from GSE37704, in Figure 7), as well as supplemental figures that are easy to miss (link to PDF here)

-Charles

ADD REPLYlink modified 27 days ago • written 27 days ago by Charles Warden5.5k

Actually i want to identify cancer related genes using python as programming language so you mean to say i must give raw counts data to my feature reduction algorithm and then classify??

ADD REPLYlink written 4 weeks ago by maryam.akram00110
1

All the mentionned packages (DESeq2, edgeR) are R packages. You should definitivelly try R for such analysis. Maybe you could use DESeq2 vsd transformed read count matrix for subsequent analysis in python

ADD REPLYlink written 4 weeks ago by Nicolas Rosewick6.8k

I do not know what is your feature reduction algorithm, but I am highly doubtful that it performs any sort of normalisation. Trying to obtain statistical inferences from raw counts in any setting would be a bad move.

As per Nicholas. I would encourage you to make the most of the opportunity and to learn some R skills, too. You can use the transformed counts (variance stabilised or regularised log) from DESeq2 for most downstream applications.

Of course, if you need guidance with regard to R, et cetera, then let us know. We are all learning each day and it would be wrong of anyone to assume that they have already learned enough - even Full Professors near retirement are learning.

You may make an initial start from here: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#count-matrix-input

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Kevin Blighe30k

I am using particle swarm optimization for feature selection.... and the data that i am using is also available in HTseq-Counts form some of the information regarding data is below.. type of data: gene expression RNAseq unit: log2(count+1) There is also a note written on the site that "Data from the same sample but from different vials/portions/analytes/aliquotes is averaged; data from different samples is combined into genomicMatrix; all data is then log2(x+1) transformed"

ADD REPLYlink written 4 weeks ago by maryam.akram00110

I see, I think that I know which data that is, in that case. It must be FPKM-UQ. In realising that FPKM was not ideal, the consortium created the FPKM-UQ method, which is like putting a cheap bandage on a deep wound. That data is normalised, though, and you would be able to use it for your algorithm. Just be aware that FPKM counts are not the greatest. Put it this way: if you used that data and ever wanted to publish, your paper would not be rejected. Best of luck.

As a final thing, just check the distribution of the data via histogram. It should generally follow a binomial distribution, which, I assume, the particle swarm optimization algorithm expects.

ADD REPLYlink written 4 weeks ago by Kevin Blighe30k

Thanks Kevin you are right its FPKM-UQ data.Now can you please guide me about what does the values in this data indicate i.e does 0 zero means gene is not expressed and some higher value means gene is highly expressed???

ADD REPLYlink written 4 weeks ago by maryam.akram00110

Yes, you can interpret it that zero means no expression. It follows, then, that higher values equate to higher expression. However, in FPKM, values are not comparable across different samples. In some extreme cases, FPKM 10 may be equivalent to FPKM 100 in another sample. You should consider converting the data to the Z-scale with the zFPKM R package, first.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Kevin Blighe30k

If i use htseq-counts data that is log2(x+1) transformed then normalize it using edgeR or Deseq 2 will i able to identify cancer related genes?

ADD REPLYlink written 29 days ago by maryam.akram00110
1

If you just take HTseq counts, then these are 'raw' counts and are not normalised. You may mean normalised counts that derived from HTseq.

If you can just point me to the exact data that you're using, then it may help. It is obviously TCGA data, and I do feel sorry for anyone new who is trying to get a grasp of this [TCGA] data. It is 'sprayed' all over the World Wide Web and is in various states of processing on various third party websites.

ADD REPLYlink written 29 days ago by Kevin Blighe30k

https://xenabrowser.net/datapages/ i am talking about TCGA datasets being mentioned in this link.. if we open a particular cancer type then after that under "gene expression RNAseq" there are three form available HTseq-counts,Htseq-FPKM,HTseq-FPKM-UQ..Plz guide me which is the best option to use in order to identify cancer related genes..

ADD REPLYlink modified 29 days ago • written 29 days ago by maryam.akram00110

HTseq-counts is the raw read count using htseq-count : https://htseq.readthedocs.io/en/master/count.html. So if you want to use DESeq2 or edgeR, you should use these files.

ADD REPLYlink written 29 days ago by Nicolas Rosewick6.8k

Thanks Nicolas. Maryam, you must mean this: h

HTSeq - Counts: these are the raw counts; however, Xena has incremented them and then logged (base 2). To get these back to their original form, you need to do:

(2^[counts]) - 1

You would then normalise the values from this in EdgeR or DESeq2.

Conversely, if you take HTSeq - FPKM / FPKM-UQ, you can just convert these to Z-scores using zFPKM package, and then just use those.

ADD REPLYlink written 29 days ago by Kevin Blighe30k

yes you are right..After normalizing with DESeq2 or edgeR will i be able to apply particle swarm optimization on normalized data using python??can you please guide me how to use this normalized data with python? Secondly what effect will it make on normalization results if don,t reverse log and subtract 1

ADD REPLYlink modified 29 days ago • written 29 days ago by maryam.akram00110

If you take just the HTseq - Counts data in its current form and just use that, then your conclusions from the data will contain error. Why? - because this data is normalised for neither within- or cross-sample biases.

ADD REPLYlink modified 28 days ago • written 28 days ago by Kevin Blighe30k

Kevin i have tried to normalize HTseq count data using edgeR .I used calcNormFactors function to obtain TMM normalized value but my output file shows colums like samples.lib.size,samples.norm.factors but its not giving me TMM values can you please guide me how to do this??

ADD REPLYlink written 17 days ago by maryam.akram00110

Good work, dude! You should use the cpm() function in order to extract the normalised counts. Please see here: https://support.bioconductor.org/p/77193/

ADD REPLYlink written 17 days ago by Kevin Blighe30k

Thanks for your kind words... CPM() will return me counts per million ?After that will i be able to perform cross sample analysis(differential expression analysis) on these values returned by CPM. i am unable to understand the logic behind it .

ADD REPLYlink written 15 days ago by maryam.akram00110

If you look at the vignette for EdgeR, there will be information on how you can conduct differential expression analysis. Take a look Here (page 19)

The cpm() transformation is mainly for downstream analyses.

ADD REPLYlink written 15 days ago by Kevin Blighe30k
1
gravatar for c.chakraborty
4 weeks ago by
c.chakraborty90 wrote:

Hey I am adding these links so you can go through to understand the normalization of FPKM/RPKM and TPM- https://statquest.org/2015/07/09/rpkm-fpkm-and-tpm-clearly-explained/ https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/ FPKM of a gene (say ABB) is paired end reads normalized to total reads (all genes) sequenced and identified for a sample, per kilobase of the gene ABB. Also since you're going to work with RNA-Seq data, you might find this link useful for further downstream analysis. https://support.bioconductor.org/p/53986/

ADD COMMENTlink written 4 weeks ago by c.chakraborty90
1
gravatar for sophiespo
4 weeks ago by
sophiespo60
Vancouver
sophiespo60 wrote:

The most popular differential expression algorithms require that you supply raw counts, not FPKM.

DeSeq2

edgeR

Here's a good differential expression workflow using edgeR in R

ADD COMMENTlink written 4 weeks ago by sophiespo60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1802 users visited in the last hour