Comparison different Datasets without having the FASTQ
2
1
Entering edit mode
3.9 years ago
camillab. ▴ 160

Hi, Is any way to compare and plot differences& similarities of different dataset? I have different dataset with the counts already normalised by sequence depth and gene length (so RPKM/FPKM). To compare how similar/dissimilar are the samples (e.g. PCA) I should do a batch correction since the data have been generated not at the same time. Is a way I can do so on the normalised counts on R? how do I do a batch correction without having the raw count?

thanks for any suggestion

Camilla

bulkRNAseq batch correction R RPKM • 1.8k views
ADD COMMENT
2
Entering edit mode
3.9 years ago
ATpoint 82k

You have three major problems:

  1. Completely different organisms: Mouse and Zebrafish.
  2. Probably different library preparation and RNA extraction methods, so a technical wetlab bias.
  3. Probably different bioinformatics pipelines that produced these RPKMs.

That might make it quite difficult to combine results, so the attempt of making a standardized dataset as Kevin suggested below is probably reasonable here.

Still, what I would try is to compare the actual biological readouts, so e.g. are the same cellular pathways up- and downregulated upon that treatment. This is the most relevant (and actually the only relevant) criterium as it is the biological readout whereas PCA and company are "only" statistical methods. For this you would need raw data to perform differential expressionanalysis, or at least a list of differential genes. Another approach would be to use Gene Set Enrichment Analysis, using the top-differential genes from each organism as gene sets and then the results from the other organism as query dataset to perform the GSEA on that gene set.

If you can get the raw data then here is what I'd try:

1) Process with identical bioinformatics pipelines.

2) Perform differential analysis

3) Get enriched pathways per organism and compare either with a statistical test or simply by eye using your biological knowledge

4) define gene sets from both organisms (hopefully the respective genes have homologs between the species) , say the top-500 most up- or downregulated genes and perform GSEA based on these gene sets.

ADD COMMENT
0
Entering edit mode

thank you for the details comment! I am comparing the actual readouts but I was wondering if I could show general differences/similarities using a PCA that visually helps a lot to say "oh they are completely different". My analysis as you said, cannot be limited to PCA and reading I realised that without actual raw data it's difficult to do it (but I will try Kevin suggestion). thank you for all the suggestion!

ADD REPLY
0
Entering edit mode

I like that you try to perform multiple types of analysis. In my experience though these high dimensional analysis are often not conclusive even if perfectly executed and even if you have quality data. Eventually all that matters is the biological readout because the biology is what a reviewer looks at when you publish things. That having said, if computational approaches support biological findings that is awesome and increase confidence, but a non-conclusive computational result is in your case here imho not necessarily a problem if you see clear biological effects which maybe you can even back up with additional experiments. Feel free to try out different approaches (this is a great exercise and you learn a lot on the way) but if time is a limiting factor I would always focus more on biological readouts than trying to put together a purely computational analysis.

ADD REPLY
1
Entering edit mode
3.9 years ago

RPKMs/FPKMs are normalized values.

ADD COMMENT
0
Entering edit mode

I have different dataset with the counts already normalised by sequence depth and gene length (so RPKM/FPKM).

yes. it's what I said, do you have any suggestion?

ADD REPLY
0
Entering edit mode

Try running PCA or QQ plot with RPKM/FPKM values. Based on the results, you would know if your samples have batch effect.

ADD REPLY
0
Entering edit mode

Thank you. and if my samples have batch effect, ow do I correct it if I don't have the raw count but only the normalised ones?

ADD REPLY
0
Entering edit mode

Check how the samples are grouping as per experimental conditions. @ Kevin is one of the right persons to take your query forward.

ADD REPLY
0
Entering edit mode

You have two different datasets representing 2 different conditions that you want to compare, with 1 condition being entirely represented by one study, while the other condition being entirely represented by the other?; or the conditions are mixed across both studies?

ADD REPLY
0
Entering edit mode

let's say that I have dataset1: mouse samples untreated/treated and dataset2: zebrafish samples untreated/treated (same treatment just different model organism) and I want to see how similar mouse and zebrafish are in their response to the treatment. so I identified the genes shared across the two datasets and I did the PCA but reading I realized that, since the datasets have been sequenced in different lab/different times I need to correct for batch effect before doing any kind of analysis but I don't have FASTQ or raw count to do so because I have only RPKM/FPKM. so is the results of my PCA correct (e.g. if they cluster)? and how can I batch correct my datasets if I have only the normalized counts? thank you for your help!

ADD REPLY
2
Entering edit mode

I see - interesting! What I would do is convert each filtered dataset to standardised [Z] scores via zFPKM package, and then re-do the PCA. If you see a batch effect visibly on PC1 versus PC2, then you could eliminate that via limma::removeBatchEffect(). Keep in mind that this is just a 'quick fix'.

ADD REPLY
0
Entering edit mode

thank you! I will try. does it work with RPKMs as well?

ADD REPLY
0
Entering edit mode

Yes, it should be okay for RPKM, too.

ADD REPLY

Login before adding your answer.

Traffic: 2822 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6