Question: Integrating HTSeq count data of different samples
0
gravatar for kozhaki.seq
3.9 years ago by
kozhaki.seq50
Korea, Republic Of
kozhaki.seq50 wrote:

I make a resource for esimate the gene expresion level across many plant tissues using the RANSeq data . I have collected the  dataset of different experimental samples from GEO and other  sources. Now, Using HTSeq, I  estimate the count for each sample (ie, samples from different experiment). Finally, I merge all the dataset to a single source, so that the expression level of a gene can be viewed across all samples (using heatmap of count data). But, I concern about the signifcance of my method. Could anyone tell about my strategy?

I have two specific doubt,

1. Is it significant to merge the data since the different experiment may have the 'batch effect'?

2. If it is ok to merge sample, I should consider the HTSeq count data or FPKM for the hheatmap?

 

Thanks.

tool fpkm rnaseq htseq • 2.5k views
ADD COMMENTlink modified 3.9 years ago by kangyueapril80 • written 3.9 years ago by kozhaki.seq50

What do you mean by merge samples ?

Generally it should be ok to take different GEO data sets and compare them provided they are similar type of experimental designs and different conditions/cell lines.

What is the variation in terms of number of reads per sample across different samples ?

You need to normalise the data before you plot any heatmaps.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by geek_y9.1k
1
gravatar for mark.ziemann
3.9 years ago by
mark.ziemann1.1k
Australia/Mebourne/Geelong/Deakin
mark.ziemann1.1k wrote:

1. Is it significant to merge the data since the different experiment may have the 'batch effect'?

Yes there will be a batch effect due to many technical reasons, but unless you're going to perform the experiment again, then you don't have much choice. Still, I would recommend validating some of the major findings in your own plant tissues with a method like RT-qPCR to show that the RNA-seq trends are real.

2. If it is ok to merge sample, I should consider the HTSeq count data or FPKM for the hheatmap?

As Geek_y states, you do need to normalise the data because each dataset will have different number of tags. FPKM is a widely accepted method for doing this.

 

ADD COMMENTlink written 3.9 years ago by mark.ziemann1.1k
2

Fpkm is not normalisation. Always normalise. See for example deseq2 or edgR packages in R. 

ADD REPLYlink written 3.9 years ago by Danielk560
1

Presumably you meant, "FPKMs are not widely accepted as normalized values", which would indeed be true. They are normalized values, it's just that the method is easily biased and the resulting values less useful for statistics.

ADD REPLYlink modified 3.9 years ago by Sean Davis25k • written 3.9 years ago by Devon Ryan88k
1

True Devon & Danielk, FPKM is not a robust method for determining differential expression but would be OK for visualisation of genes of interest in a heatmap as the OP requires.

ADD REPLYlink written 3.9 years ago by mark.ziemann1.1k
1
gravatar for Gary
3.9 years ago by
Gary450
Taiwan/Taichung/China Medical University Hospital
Gary450 wrote:

Danielk, Devon, and Mark are right. TMM (edgeR) & DESeq are much better than FPKM. The below is a good paper and its conclusion for your reference.

Dillies, et al. (2013). A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. 14, 671-683.

Key points

1. Normalization of RNA-seq data in the context of differential analysis is essential in order to account for the presence of systematic variation between samples as well as differences in library composition.

2. The Total Count and RPKM normalization methods, both of which are stillwidely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

3. Only the DESeq and TMM normalization methods are robust to the presence of different library sizes and widely different library compositions, both of which are typical of real RNA-seq data.

ADD COMMENTlink written 3.9 years ago by Gary450
1
gravatar for kangyueapril
3.9 years ago by
kangyueapril80
United States
kangyueapril80 wrote:

I don't know which software you want to use further. But people more like to use DESeq, edgeR and limma-voom for normalization and DEG analysis. In this three software, a size factor will be calculated for every sample, and normalized samples by their own size factor. If you data come from a same batch, you can just export the counts matrix out after normalization. If not, when you do DEG analysis, or other kind of analysis, condition and batch inference influence should be considered as same time. When you make the design matrix, it will like d1=model.matrix( ~-1+ condition+batch, data), d0=model.matrix( ~-1+ condition, data). Then use d1 and d2 to build model1 and model0 (how to build depends on software you use).  model1-model0 is the model without batch influence.

 

ADD COMMENTlink written 3.9 years ago by kangyueapril80

Thanks all for their views and comments...Now I got the pitfalls...!!

ADD REPLYlink written 3.9 years ago by kozhaki.seq50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 852 users visited in the last hour