Question: What are the biggest challenges bioinformaticians have with data analysis?
11
gravatar for klemen
6.4 years ago by
klemen170
Slovenia
klemen170 wrote:

Dear all, 

 

I am doing a research among bioinformaticians, and I am interested in understanding your work, the challenges, and the opportunities.

 

So my question is, what are the challenges bioinformaticians have with data analysis?

 

Thank you in advance. 

Klemen

ADD COMMENTlink modified 6.3 years ago • written 6.4 years ago by klemen170
5

see What Do You Consider The Most Trivial And The Most Challenging Tasks In Your Particular Field Of Work?

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by Pierre Lindenbaum134k
3

Yes: this question basically asks "has anything altered since that previous question" and the answer is "no."

ADD REPLYlink written 6.4 years ago by Neilfws49k

Chris Miller's mention of data munging should probably be emphasized. Getting data to a useable state is often half the battle.

ADD REPLYlink written 6.4 years ago by Devon Ryan98k

see What Do You Waste Your Time On

ADD REPLYlink written 6.4 years ago by zx875410.0k
9
gravatar for geek_y
6.4 years ago by
geek_y11k
Barcelona
geek_y11k wrote:

The Bioinformatics data analysis involves multiple things. Few from my experience,

1.Installing the software:

Sometimes a tool has many dependencies, which makes it hard to install the tool. It takes fair amount of time to figure out all the dependencies and make the tool working. Some times the tools are dependent on particular linux distribution. Recently, I troubleshooted many issues installing SURPI pipeline on CentOS, which is originally tested on Ubuntu only.

2. Handling huge amount of data:

If the tools does not provide multithreading options, it will take time to optimise yourself with splitting the jobs.

3. File format comparability:

Some software expects the files to be in a particular format. For example, the variant file (vcf) given by Ensemble contains IUPAC letters(R,K,M etc) and some special characters line '~', which GATK does not accept. This requires multiple modifications to vcf file, before making it to work. In this process we need to sacrifice some information.

4. Understanding statistical concepts:

It is very difficult to understand the core statistical concepts behind some algorithms. Most of the times, the statistical concepts are not documented properly. Without statistical knowledge, it is difficult to choose a particular software, as there are multiple tools available for the same purpose and there will be always new papers coming up.

 

ADD COMMENTlink modified 4.0 years ago • written 6.4 years ago by geek_y11k
8
gravatar for smilefreak
6.4 years ago by
smilefreak420
New Zealand
smilefreak420 wrote:

Although this is not a technical concern, the biggest challenge I have faced when doing bioinformatics analyses at scale is communicating the results with biologist not well versed in the area of bioinformatics. 

Some people may be skeptical and some may not believe the results.  Learning to communicate these big data analysis in an understandable way is a believe a key factor in this field. Of course you should always look to improve your technical chops in various aspects that are important to your kind of bioinformatics, this could include data management, visualisation, analysis and biology. However, always ensure to allocate adequate time to improving your communication skills.

ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by smilefreak420
5

I agree with this one.

Installing software, while it can be a pain is certainly not my biggest challenge. It is harder than it should be in a lot of cases, but correctly defining the question, the analysis, and the steps to do it (by talking with biologists/clinicians) seems to be in (my) practice much more difficult, error prone, and hard to get right--though I think that's part of the scientific process.

If the general sentiment of other bioinformaticians is that installing/using software is the hardest part, then I must be the outlier. Sure, I have a tough time installing / using some software, but (unless I'm forced to use those methods for a comparison paper) it's not the biggest challenge.

ADD REPLYlink written 6.4 years ago by brentp23k
2

Installing/using software is certainly not my biggest problems, but if you sum up across developers and mass ENDUSERS, time wasted on installing and figuring out the right parameters is significant. IMHO, a general problem in bioinformatics is that developers haven't paid enough attention to endusers. Quite a few think too much about how to write better code but too little about how to make endusers enjoy their tools. Most endusers don't have decent hardware. They lack the skill set developers possess. Their opinions and preferences are actually not often refelected in this forum where developers dominate. Outside this forum, there are much more endusers than developers.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by lh332k
1

sounds like you agree with Pat Schloss: Pat Schloss of Mothur uses this

there are some new python tools e.g. Nuitka that are supposed to be able to make a portable binary from a python package that even includes the stdlib. Maybe that is a reasonable solution rather than starting from scratch every time.

I also think the docker image thing should mitigate these problems as well, but docker itself is another thing to learn.

ADD REPLYlink written 6.3 years ago by brentp23k

I would add maintain to user experience. I agree these are huge problems, but I think it a more symptomatic of publishing in biological journals and how the biological sciences measures researchers. The current system places pressure on researchers time and many are not willing to endure the extra work that is required for these UX dreams. As publication is often thought of as an end-point, and therefore getting a working application or result to publication is the goal.

Although, I think we are beginning to see a shift away from this as journals are beginning to notice that hard-to-use bespoke software tools that are tied together in custom bash scripts by non-experts, is not boding well for the reproducibility and thus will likely be reflected in the citation count.

Where I am from a useful change would be that if you develop, at extra effort, a user friendly tool and then maintain the tool it needs to count as a measurable outputs in a biological science department (which generate many of these tools). I know many would argue that a good tool will garner many citations regardless, but in some cases users it won't, also if you are the only tool on the block then UX doesn't really matter (especially if you are the only tool in a specific area), so it needs to be counted on CVs etc.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by smilefreak420

I'd like to second that, because I think that's probably the biggest challenge Bioinformaticians actually face - even though it's rarely admitted or even acknowledged by working bioinformaticans and biologists alike!
I think, one should remember that the answers to the OP's question will not be 'what are the challenges bioinformaticians have with data analysis?' but rather 'what do you THINK are the challenges bioinformaticians have with data analysis'.
"Big data" always comes up as a challenge, but in my opinion that is just a safe place to go when you are forced to think about what scares you. Big data isn't really a problem - if it was it wouldn't be such a vague term. Who cares if your awk or grep takes an extra 5 seconds, 5 minutes, or even 5 hours - if that's what it takes, that's what it takes. Go get a coffee.
So big data isn't really a problem - but confusing data is, and there's plenty of that!

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by John12k
7
gravatar for Skeletor
6.4 years ago by
Skeletor90
Calgary, Alberta Canada
Skeletor90 wrote:

Installing software written by academics and getting it to work properly (-:

ADD COMMENTlink written 6.4 years ago by Skeletor90
5
gravatar for Chrispin Chaguza
6.3 years ago by
Wellcome Sanger Institute
Chrispin Chaguza260 wrote:

I think the biggest challenges comes from lack of understanding of the theory or principles behind the tools being used as such most basic asumptions or necessary conditions about the data that have to be met are ignored or violated which can sometimes lead to false conclusions or inferences. It's both ways I think, computer scientists or statisticians developing the tools might not fully understand the biology behind the problem being addressed and biologists might not fully understand the algorithms and assumptions behind the tools.

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Chrispin Chaguza260
3

I think this is a really good point that IMHO greatly hinders progress

we need to worry not when the tool does not work but rather when it seems to work but in fact does something radically different that we don't realize - that leads to great waste of human potential 

ADD REPLYlink written 6.3 years ago by Istvan Albert ♦♦ 86k
0
gravatar for klemen
6.3 years ago by
klemen170
Slovenia
klemen170 wrote:

Thank you all for your responses and opinions. 

I have done several discussions with bioinformaticians in the last month. Based on what I have discussed with bioinformaticians and what you have mentioned in the comments above, I have prepared a short survey with some questions regarding data analysis. 

I would kindly ask you to answer a few simple questions. It will take you only 5 minutes, as the questions are really short and the possible answers very simple.

Please find the form here: http://bit.ly/1zvB2Mj

If you are interested in the final results of the survey, feel free to leave your e-mail in the form as well.

Thank you in advance for your time,

ADD COMMENTlink written 6.3 years ago by klemen170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2217 users visited in the last hour
_