Question

Comparison different Datasets without having the FASTQ

1

Entering edit mode

3.9 years ago

camillab. ▴ 160

Hi, Is any way to compare and plot differences& similarities of different dataset? I have different dataset with the counts already normalised by sequence depth and gene length (so RPKM/FPKM). To compare how similar/dissimilar are the samples (e.g. PCA) I should do a batch correction since the data have been generated not at the same time. Is a way I can do so on the normalised counts on R? how do I do a batch correction without having the raw count?

thanks for any suggestion

Camilla

bulkRNAseq batch correction R RPKM • 1.8k views

ADD COMMENT • link updated 3.9 years ago by ATpoint 82k • written 3.9 years ago by camillab. ▴ 160

score 2 · Answer 1 · 2020-06-13

You have three major problems:

Completely different organisms: Mouse and Zebrafish.
Probably different library preparation and RNA extraction methods, so a technical wetlab bias.
Probably different bioinformatics pipelines that produced these RPKMs.

That might make it quite difficult to combine results, so the attempt of making a standardized dataset as Kevin suggested below is probably reasonable here.

Still, what I would try is to compare the actual biological readouts, so e.g. are the same cellular pathways up- and downregulated upon that treatment. This is the most relevant (and actually the only relevant) criterium as it is the biological readout whereas PCA and company are "only" statistical methods. For this you would need raw data to perform differential expressionanalysis, or at least a list of differential genes. Another approach would be to use Gene Set Enrichment Analysis, using the top-differential genes from each organism as gene sets and then the results from the other organism as query dataset to perform the GSEA on that gene set.

If you can get the raw data then here is what I'd try:

1) Process with identical bioinformatics pipelines.

2) Perform differential analysis

3) Get enriched pathways per organism and compare either with a statistical test or simply by eye using your biological knowledge

4) define gene sets from both organisms (hopefully the respective genes have homologs between the species) , say the top-500 most up- or downregulated genes and perform GSEA based on these gene sets.

score 1 · Answer 2 · 2020-06-12

1

Entering edit mode

3.9 years ago

cpad0112 21k

RPKMs/FPKMs are normalized values.

ADD COMMENT • link 3.9 years ago by cpad0112 21k

0

Entering edit mode

I have different dataset with the counts already normalised by sequence depth and gene length (so RPKM/FPKM).

yes. it's what I said, do you have any suggestion?

ADD REPLY • link 3.9 years ago by camillab. ▴ 160

0

Entering edit mode

Try running PCA or QQ plot with RPKM/FPKM values. Based on the results, you would know if your samples have batch effect.

ADD REPLY • link 3.9 years ago by cpad0112 21k

0

Entering edit mode

Thank you. and if my samples have batch effect, ow do I correct it if I don't have the raw count but only the normalised ones?

ADD REPLY • link 3.9 years ago by camillab. ▴ 160

0

Entering edit mode

Check how the samples are grouping as per experimental conditions. @ Kevin is one of the right persons to take your query forward.

ADD REPLY • link 3.9 years ago by cpad0112 21k

0

Entering edit mode

You have two different datasets representing 2 different conditions that you want to compare, with 1 condition being entirely represented by one study, while the other condition being entirely represented by the other?; or the conditions are mixed across both studies?

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

let's say that I have dataset1: mouse samples untreated/treated and dataset2: zebrafish samples untreated/treated (same treatment just different model organism) and I want to see how similar mouse and zebrafish are in their response to the treatment. so I identified the genes shared across the two datasets and I did the PCA but reading I realized that, since the datasets have been sequenced in different lab/different times I need to correct for batch effect before doing any kind of analysis but I don't have FASTQ or raw count to do so because I have only RPKM/FPKM. so is the results of my PCA correct (e.g. if they cluster)? and how can I batch correct my datasets if I have only the normalized counts? thank you for your help!

ADD REPLY • link 3.9 years ago by camillab. ▴ 160

2

Entering edit mode

I see - interesting! What I would do is convert each filtered dataset to standardised [Z] scores via zFPKM package, and then re-do the PCA. If you see a batch effect visibly on PC1 versus PC2, then you could eliminate that via limma::removeBatchEffect(). Keep in mind that this is just a 'quick fix'.

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

thank you! I will try. does it work with RPKMs as well?

ADD REPLY • link 3.9 years ago by camillab. ▴ 160

0

Entering edit mode

Yes, it should be okay for RPKM, too.

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k