Question

Forum:How to identify your analytic data is true?

0

Entering edit mode

6.7 years ago

Hughie ▴ 80

Hello!

Recently I got a problem puzzled me -- how do you know the data you are analyzing is indeed what you wanted and how to prove it using bioinformatic methods?(In a other word, how do you know your data is true?)

I have searched google for a long time but can't find a proper answer. So, I will appreciate if you can give me some advice!

RNA-Seq ChIP-Seq next-gen-sequencing • 2.4k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 6.7 years ago by Hughie ▴ 80

2

Entering edit mode

I changed this to forum because it looks more like a discussion thread. I have the impression that you are referring to "correctness of annotation" like:

I was given data that was annotated as "mouse kidney" but the tubes got mixed up, maybe it is now "brain or liver". How do I find out?

or do you refer to "fabricated/fake data"?

or do you refer to "measurement error"?

ADD REPLY • link 6.7 years ago by Michael 54k

0

Entering edit mode

Thank you Michael for your kindly behavior! I meant closer to "measurement error", and do you know some bioinformatic methods to judge? Thank you again.

ADD REPLY • link 6.7 years ago by Hughie ▴ 80

0

Entering edit mode

As you didn't mention a specific type of experiment:

replication, replication, replication, ....
proper experiment design
rigorous quality control at all steps during the analysis
validation by independent experiments, e.g. validation of RNAseq DE genes by qPCR, variant calls by Sanger sequencing, etc.

ADD REPLY • link 6.7 years ago by Michael 54k

1

Entering edit mode

If your data exists in a computer and can be analysed by bioinformatics methods, it must be true, no?

Maybe if you really explain what you mean you can get an useful answer. What kind of data are you talking about? Sequencing? Simulations? Real data? Hypothetical data?

What do you mean by "the data you are analyzing is indeed what you wanted"? No contaminations? The data reflects some experimental conditions?

What do you mean by "your data is true"? You think someone created some fake data?

ADD REPLY • link 6.7 years ago by h.mon 35k

0

Entering edit mode

Thank you! h.mon

I'm sorry for my amphibolous explaination.

I mean that if I got experimental samples and sequenced them, how can I assert these data is want I need which reflects some experimental conditions.

You can ignore "your data is true".

ADD REPLY • link 6.7 years ago by Hughie ▴ 80

score 2 · Answer 1 · 2017-08-11

2

Entering edit mode

6.7 years ago

venu 7.1k

I once received microarray gene expression data to analyse. 3 conditions, 9 samples, 3 replicates per each. As soon as I received the data I normalized and did simple hierarchical clustering with 1k,2k,3k highly variable probes. We had some prior idea of how the clusters should look because of the conditions (healthy, tumor, tumor treated). We also received the order of samples filled in the chip. But in clusters array position-6 and array position-9 were always swapped. We were skeptical that something was definitely wrong. We went back to person who filled the chip with samples.

Guess what, the person who did the experiment has some confusion between numbers 6 and 9. Hence the skeptical result.

The point is, for microarray data it is little easier to find this kind of flaws. But for sequence data once should be careful enough from the beginning to control these mistakes. Again it depends on what kind of sequencing data you are handling to find out the data you are analysing is the data you intended. If you have some prior idea of the samples (like sample-A has P53 mutation, sample-B has BRCA mutation), it would be easy to identify these mutations from WGS/WES and validate/believe the entire process went well. For ChIP-sequencing experiments, I would believe everyone involved in the process are working carefully and go with the analysis (which I am doing now).

ADD COMMENT • link 6.7 years ago by venu 7.1k

0

Entering edit mode

Exchange of sample labeling has happened to me, not once, not twice but thrice, both with sequencing and microarray! It is easy to detect it through clustering and correlations, however, if the mixed-up sample is completely unrelated, that might be a problem, depending upon how much close it is to the "true" samples.

ADD REPLY • link 6.7 years ago by Santosh Anand 5.7k

1

Entering edit mode

If the samples are sequenced in pairs i.e. tumor and matched normal, one can use methods like NGSCheckMate to confirm both samples are from same patient.

if the mixed-up sample is completely unrelated, that might be a problem

o_O Hope this never happens to anyone given the cost of the experiment :)

ADD REPLY • link 6.7 years ago by venu 7.1k

0

Entering edit mode

Hi, what do you mean by mislabeling? what if I am confused about if I am mislable my sample, input and igG? Could we figule it out easily? Thanks a lot!

ADD REPLY • link 6.4 years ago by hijack00 • 0

0

Entering edit mode

We also notice sample labeling mistakes. We always check the sequencing data with genotype information to identify any mislabeled samples.

ADD REPLY • link 6.7 years ago by GouthamAtla 12k

0

Entering edit mode

Thank you venu for your kindly sample and suggestion!

ADD REPLY • link 6.7 years ago by Hughie ▴ 80

score 1 · Answer 2 · 2017-08-11

What is your definition of truth? If you are just worried about the samples being mislabeled/exchanged, then there are various means to diagnose it. Also given the fact that samples are run in replicates, if something odd happens to one or some of them, you might see it from various computational analysis because the odd man will stand out. Then there is truth at biological level: what if my sampling is not the right one representing my biology? This is harder to detect, but not impossible. Experiments are run usually either to validate already existing hypotheses, or to generate further hypotheses to get deeper insight into the biology. In either cases, you also have some idea of some positive and negative controls. If you are working with neurons, you would normally expect genes related to neuronal function to show in any kind of analysis. In contrast, you might not expect a lot of myocardial genes in neuronal experiments, but if you get them you try to figure out their correlation. If you find new correlations, you got new science (and a Nobel perhaps!); however, if it doesn't make any sense with the existing knowledge of your experiment, then, well, may be your experiment was doomed.

Just do the things, and don't worry too much about theoretical Qs like this one. If there is "true" Biology in your samples, it will probably come out by your analysis. In contrast, if it is false, it will be detected if you know your hypotheses, biology and experiments correctly.

score 1 · Answer 3 · 2017-08-11

I mean that if I got experimental samples and sequenced them, how can I assert these data is want I need which reflects some experimental conditions.

That is very subjective. You can probably tackle the "assert" part if you know an independent truth (e.g. it is common for large projects to genotype the samples so the sequence results can be verified by calling SNP's, if a question arises about mixups, I don't think this is done as a common practice).

The experimental part is going to be more tenuous. Sequence is invariant as opposed to experimental conditions. Unless it was a specific type of sequencing (e.g. RNAseq in response to some condition) it would be difficult to make a direct connection (one case I can think of is a mutation(s) which could be confirmed by plain sequencing).