I make a resource to estimate the gene expression levels across many plant tissues using the RNASeq data . I have collected the dataset of different experimental samples from GEO and other sources. Now, Using HTSeq, I estimate the count for each sample (i.e., samples from different experiment). Finally, I merge all the dataset to a single source, so that the expression level of a gene can be viewed across all samples (using heatmap of count data). But, I concern about the significance of my method. Could anyone tell about my strategy?
I have two specific doubt,
- Is it significant to merge the data since the different experiment may have the 'batch effect'?
- If it is ok to merge sample, I should consider the HTSeq count data or FPKM for the heatmap?