Question

How to ensure your data is truly labeled?

0

Entering edit mode

6.1 years ago

Hughie ▴ 30

Hi! everyone:
I'm a beginner in bioinformatics and I can't think things comprehensively due to too little experience, Here, I got a question doubts me a lot:

When you analyze the data labeled as WT/KO, you go to the standard workflow. But how do you know the data is truly labeled? (maybe someone mistook the sample or you made some mistakes during rename ) .
Even more, how to detect your data is true ChIP-seq/RNA-seq... data?
or within the ChIP-seq data, how do you know it's really H3K4me1/H3K27me3?H3K27Ac, etc?

At present, I have the thoughts below (respect to the question number):

Using your replicates or download similar data to do PCA or clustering.
Check the reads coverage along the genome (This question may be a little naive).
There are some published profiles about typical marks, we can make a comparison.

But as we know, things may be worse. So, can you think more ideas or if you have done similar checks, can you share your experience?
Many thanks for your attention and suggestions!

experience data analysis • 932 views

ADD COMMENT • link updated 5.8 years ago by swbarnes2 14k • written 6.1 years ago by Hughie ▴ 30

1

Entering edit mode

Here is another related discussion: Estimating cross contamination in a set of BAMS

ADD REPLY • link 5.8 years ago by igor 13k

1

Entering edit mode

5.8 years ago

swbarnes2 14k

You probably can't ensure that every person handling the sample before you as 100% accurate in labeling. All you can do is try to do your best not to mix up sample names at your end, and see if there isn't a sanity check you can run to see if things look approximately like they should. If you aren't expert in the biology you are analyzing, the person you are returning data to should be, and should tell you if the data looks really weird.

You are responsible for your work, you really can't be expected to catch 100% of other people's errors.

ADD COMMENT • link 5.8 years ago by swbarnes2 14k

score 3 · Accepted Answer · 2018-07-03

At some point you should be doing a PCA or other clustering, at which point it should become obvious that some samples were swapped or that there's something else odd going on.
Always look at your data, it should then be obvious. Having said that, you should always know what you sequenced.
See above.

You should always have at least a vague expectation about what your data should look like, you then need to see if your assumptions match what the data really does look like and re-evaluate things if not.