What Is The Best Way To Find Common Transcription Factor Binding Site From A Set Of Genes?
3
2
Entering edit mode
11.2 years ago
ugly.betty77 ★ 1.1k

Dear bioinformaticians, I am posting this question on behalf of another researcher, who needs help.

"Hello,

I have a set of co-expressed genes from the human genome and would like to find common transcription factor binding site from them (or a subset). My biological story is already written up and, based on that, I like to get certain set of genes to show up in the analysis. Therefore, I am thinking about the following strategy. I will try various online databases or services with all of my co-expressed genes and pick up and cite the one that shows the highest number of my preferred genes. Is that acceptable? How do the reviewers verify the predicted transcription factor binding sites, or do they accept the program and claims at face value? I come from a psychology background and do not know any statistics or bioinformatics. Any help is welcome."

"

Edit. I am trying to learn how bioinformaticians handle the above kind of 'scams', when they read or review a paper. For example, let me consider the first suggestion of oPOSSUM or MEME. An author tries both programs and sees 'expected' result with MEME. In his paper, he reports that MEME gave him a motif with certain short list of 'expected' genes and ignores the oPOSSUM result. The paper will look more sophisticated in terms of bioinformatic analysis than someone who did not try to look for promoter binding sites. Given that we have so many published software programs for every stage of analysis, an unethical user can bias each step to get to the 'right' biological result and publish in a top journal. How do you guys handle such issues? Based on what I experienced so far, most (bioinformatics) reviewers are happy, if the paper speaks the right statistical/bioinformatic lingo, and leaves the biological or medical part to the 'biologist expert'. With so many tools out there, isn't there room for huge subjective bias in the whole process? What are the rules to evaluate the judgement of the expert biologists? How do we know that an entire subfield is not being biased through the opinions of few experts?

On the other hand, we do not (and possibly cannot) require each author to use every software tool and report all results, and then ask the expert biologist to evaluate all options. That will require the biologist to learn and understand the algorithmic difference between programs, which is nearly impossible. Neither can we require the biologist to check each selected gene in the lab before saying anything about the experiment. Moreover, with the biologist typically being in control of grant and thus the entire process, the bioinformatician has less room to play differently and voice his opinion.

Under those considerations, how do we make sure that an entire subfield is not being created to 'defraud' the larger scientific community?

Among various types of popular programs, (a) TF binding site prediction software, (b) miRNA target prediction software and (c) gene analysis based on positive selection often appear to be biased in my opinion.

transcription-factor binding subjective • 9.9k views
ADD COMMENT
1
Entering edit mode

The quotes indicate that you are citing someone verbatim. Yet I have a hard time imagining any scientist stating the above.

ADD REPLY
0
Entering edit mode

I'm sure that will all come out in the blog.

ADD REPLY
0
Entering edit mode

Istvan, what I wrote is based on a paper that I just went through. It appears very sound and sophisticated in terms of the steps of bioinformatic analysis and statistical jargon, but I cannot be sure that they did not shop for the right promoter binding package. How do you judge the validity of that particular step? I often tend to have similar questions about (a) miRNA target prediction programs, (b) positive selection, but most reviewers like to see the calculation done than not being done, and the papers appear sophisticated with those bioinformatics blocks. However, the biological description always appears subjective ("Our set of 97 genes includes gene X previously known to be related to aging. Therefore our analysis tells the truth.") and yet the bioinformatics behind it is from a well-cited paper and hence technically sound. Still there is huge room for subjectivity. What are the criteria for evaluating such papers?

On a similar note, few weeks back I was at a talk at University of Washington by a professor, who is setting up their cancer detection pipeline. He described a large number of alignment programs, GATK, etc. and mentioned that they use very strict statistical cutoffs to make sure they got only a handful of (100-200) variants. Then he mentioned that the small set of variants is then reviewed by a panel of three senior cancer biologists, which included Mary Claire King, to make sure everything. To me, the last step appears to be the place for huge human bias, but most bioinformaticians I know operate at the beck and call of some biologist or medical doctor. How do you judge the validity of papers then?

ADD REPLY
0
Entering edit mode

one problem is that the computational steps are also ripe with bias, from the choice of parameters to the order various operations took place, the order at which samples were merged and normalized etc. so having human oversight is not necessarily bad. What that human actually does matters more.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Hi, the tags you choose should reflect the content of your question not your background, or are you trying to make a scam?. The intention of the question the person is asking is not very clear to me (edit: it is indeed very clear, I call it fraud), or are you still playing mindgames on us or are you still trying to test us? Are you implicitly trying to post something "provokative" as a way of criticism of current practices or are you really trying to get support for forging facts? Or is it a joke. Please help me out. Regards.

ADD REPLY
0
Entering edit mode

Hello Michael,

I apologize if my post comes across differently, but I am trying to learn how bioinformaticians handle the above kind of 'scams', when they read or review a paper. For example, let me consider the previous suggestion of oPOSSUM or MEME. Let us say, an author tries both programs and sees 'expected' result with MEME. In his paper, he reports that MEME gave him a motif with certain short list of 'expected' genes and ignores the oPOSSUM result. The paper will look more sophisticated in terms of bioinformatic analysis than someone who did not try to look for promoter binding site. Given that we have so many published software programs for every stage of analysis, an unethical user can bias each step to get to the 'right' biological result and publish in a top journal. How do you guys handle such issues? Based on what I experienced so far, most (bioinformatics) reviewers are happy, if a paper speaks the right statistical/bioinformatic lingo and leaves the biological or medical part to the 'expert'. With so many tools out there, isn't there room for huge subjective bias in the whole process? What are the rules to evaluate the judgement of the expert biologists?

ADD REPLY
1
Entering edit mode

I understand your motivation quite well now and I see the point. But why didn't you ask this question directly, instead of making up some story around it, which looked just like trying to be provocative (but failing). I am pretty much a supporter of asking directly and on topic (without irony, catchy stories, and the like), and I am convinced this suits the format of BioStar best, even though it might turn out to be more boring.

ADD REPLY
0
Entering edit mode

If you allow me, I can replace my original question with the above paragraph. There is no specific reason for asking in one way versus another, and I thought my selected way of posing the question was interesting, and with appropriate tags and quotes, I could make the intention fairly clear.

ADD REPLY
3
Entering edit mode
11.2 years ago
KCC ★ 4.1k

Deciding what answer you want before you do the analysis and trying different analyses until you get that answer is not a very solid way of going about science, but I am sure the original poster knows and agrees with this. In particular, it undermines the reliability of the statistics, and dramatically increases the chances that the results will not be replicable.

I would think it's close to impossible for a reviewer to know how many analyses a bioinformatician performed, but did not publish, in order to get the result that he or she did finally ended up publishing.

However, as I said, the cost is the high risk that the results won't be replicable, which hurts the scientist in the long run.

I should say that it's quite common to do a soft version of the above strategy. The researcher has an idea in their head that the biological conclusions should match up with previous studies and they keep trying analyses until the result of their analyses looks like previous studies. Only then do they publish the results. This gets around the problem of the research turning out not to replicable because other studies have already seen the same things, but this strategy can lead to a higher probability of incorrect scientific conclusions as well.

EDIT:

In light of the change in your question, I wanted to add something to my answer. I have no idea how to determine how many analyses a person did before publishing their paper. I usually assume they have done quite a few things, that much of it didn't work out, and then they eventually found something that seemed publication quality. I have had discussions with statistician friends where we discuss whether one should adjust one's p-values (for instance) based on the number of tests one did, but did not publish. There are definitely people who have thought about these problems in a technical sense. I think you will find papers on this in the statistics literature. In practice, I think almost nobody goes beyond taking into account the specific analyses they publish however.

If you have a group of papers, you can apply techniques from meta-analysis. You can do what's called a funnel plot. It's designed to detect publication bias, meaning the tendency to exclude analyses that did not work out. I am by no means an expert, but you might find some the answers you are looking for by examining the meta-analysis literature.

Finally, there is an issue here of what school of statistics you follow: Fisher, Neyman-Pearson, Bayesian or some hybrid. The core question is when is it okay to exclude analyses based on prior knowledge? Subject matter knowledge ought to be able to help us to exclude false positives and hopefully, most bioinformaticians do have some form of biological 'knowledge'. So, it ends up being a judgement call. Unfortunately, discussing this topic further is above my paygrade so I will end here.

ADD COMMENT
0
Entering edit mode

Removed the previous answer based on your modification. Thank you for the funnel plot idea. I have been going through a set of papers, and that approach may be handy.

ADD REPLY
3
Entering edit mode
11.2 years ago
Ming Tommy Tang ★ 4.3k

Have a look at these two tools: Cscan http://159.149.160.51/cscan/

and ENCODE ChIP-Seq Significance Tool http://encodeqt.stanford.edu/hyper/

you only need to feed the program with the gene ids. very easy to use.

ADD COMMENT
0
Entering edit mode

Thanks. Looks like they handle human genome and you can adjust the upstream region. Very nice !

ADD REPLY
2
Entering edit mode
11.2 years ago

Ah, to be new to biology again... :)

"In his paper, he reports that MEME gave him a motif with certain short list of 'expected' genes and ignores the oPOSSUM result....With so many tools out there, isn't there room for huge subjective bias in the whole process? "

It's impossible to figure out how many analyses someone actually did. This is essentially a variant of the "file drawer effect", wherein only positive results are ever published. If you asked the author, he/she might simply respond that MEME's assumptions more accurately modeled his/her dataset, which could even be true (though most biologists would have no clue how to even begin to check this). You should never fully believe any single paper, or really even a set of papers from a single lab or using only a single method. This is true of biology in general and not unique to bioinformatics. Always start by assuming that everything published is wrong.

BTW, conclusions based purely on prediction should always be read as, "...something is predicted by some black box software package that we probably don't understand...so yeah, maybe that means something." If I read a paper that predicts transcription factor binding and then doesn't do any wet-bench (sequencing is not wet-bench in this example) experimental follow-up, likely involving conditional knock-out mice, then I read it more like an op-ed in the newspaper.

As the bioinformatician, part of your job is to translate what the biologists ask you to do into what they want (or should want) you to do. Since you don't have a background in biology, that'll take a lot of back-and-forth with the wet-lab guys (if you've ever done any programming for a business client, this back-and-forth will sound familiar).

"Under those considerations, how do we make sure that an entire subfield is not being created to 'defraud' the larger scientific community?"

Always remember that destroying someone else's career can advance yours. Basically, the culture will generally protect against these sorts of things. Beyond that, a paper's only strength is in the totality of experiments, all with different biases, pointing toward the same conclusion. When multiple competing groups use competing methods and still come to the same conclusion then all the better.

Also, "defraud" is the wrong word, as it assumes that the people involved know that they're wrong. Most of the published literature is wrong in some way and for completely legitimate reasons.

ADD COMMENT

Login before adding your answer.

Traffic: 1607 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6