Question: Batch Correction and Batch Size
gravatar for tasjfasfankihj
5.0 years ago by
tasjfasfankihj10 wrote:

I have pooled 123 samples together from two GEO antibody microarray studies, which used the same platform.  I downloaded the raw .gpr files and opened each one in Excel to get the scan date of each sample (presumably represented by the variable DateTime), which I recorded in another excel sheet.

My understanding is that if two samples have a different scan date, they are from different batches.  If so, then the 123 samples breakdown into the following batches:

Batch 1: 4 samples

Batch2: 2 samples

Batch3: 2 samples

Batch4: 4 samples

Batch5: 4 samples

Batch6: 8 samples

Batch7: 8 samples

Batch8: 8 samples

Batch9: 8 samples

Batch10: 8 samples

Batch11: 12 samples

Batch12: 7 samples

Batch13: 12 samples

Batch14: 3 samples

Batch15: 6 samples

Batch16: 2 samples

Batch17: 4 samples

Batch18: 1 sample

Batch19: 3 samples

Batch20: 3 samples

Batch21: 4 samples

Batch22: 2 samples

Batch23: 4 samples

Batch24: 2 samples

Batch25: 2 samples

Should I keep the above delineation of batches or should I combine small batches? Any advice?

Also, Batches 1-14 were from 11/9/2010 - 12/17/2010, while batches 15-25 were from 3/23/2012 - 4/27/2012.



ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by tasjfasfankihj10

Make a PCA plot and/or cluster the sample and see how they group. That's usually an effective way to gauge batch effects. Also, have a look at combat() in the SVA  Bioconductor package.

ADD REPLYlink written 5.0 years ago by Devon Ryan93k

Also, never ever use Excel for anything bioinformatics.

ADD REPLYlink written 5.0 years ago by 5heikki8.6k

Thanks for the response, but that didn't really answer my question.  Although I probably wasn't the most clear.  Basically:

1. Am I correct in organizing the 123 samples into 25 batches in the way that I did? Since posting this question, I've realized each sample's .gpr file has, along with DateTime, a GalFile variable with values such as: GalFile = C:\Users\Genepix\Desktop\ProtoArray\  The item of interest here is the HA20251, which I recalled seeing somewhere in the provided .xls workbook of processed data as a "lot number".  Should I consider a batch to be "samples with the same lot number" (i.e. 1 batch would be all the samples with "HA20251" in their .gal file address), or should I keep my batch definition to "all samples with the same day in their DateTime variable".

Essentially, I'm hoping to extract from the provided data files an explicit batch identification for each sample to be used in a Target file in order to upload the data into the PAA R package to then apply batch adjustment.  If I can't get explicit batch identifiers (which I think I can), then I'll need algorithms to "discover" batch effects.

2.  Assuming I was correct in organizing the 123 samples into 25 batches the way that I did, is it problematic to have batches of size 1 and 2? Is there a motivation for combining small batches with a nearby neighbor? For example, suppose 1 sample was scanned on monday, and 7 samples were scanned on tuesday, the day after.  Would it make more sense to consider them as Batch1 = 1 sample, Batch2 = 7 samples, or to have all 8 samples in one batch?

ADD REPLYlink written 5.0 years ago by tasjfasfankihj10

I answered the question you should have asked, rather than the one you did ask :)

  1. The way you're doing it currently seems correct. Perhaps using instead HA20251/etc. as a batch identifier would work better, but the only way to know would be to contact the people who produced the data (or cluster things as I suggested earlier).
  2. Batches of size 1 end up becoming useless. A batch of size 2 may be useful, depending on whether the batch members are all from the same treatment group or not (it's better if they're not).
ADD REPLYlink written 5.0 years ago by Devon Ryan93k

Good to know that helped a lot thank you!

ADD REPLYlink written 5.0 years ago by tasjfasfankihj10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2371 users visited in the last hour