Question

Help a beginner NGS data analyst to face the programming interview

3

Entering edit mode

9.9 years ago

venu 7.1k

Dear all,

After spending a lot of time on searching threads (related to interviews), I ended up asking this question. I am going to face an interview for NGS data analyst position in which the panel will ask to write programs. Initially I had a discussion with the panel and I came to understand that their work is more related to RNA-seq and especially they mentioned the nature of the duty would be QA/QC and analysis of raw sequence data. I was thinking what kind of programming related questions they might ask.

My background is completely different (pharmacoinformatics) and I am more fascinated towards genomics field. I am a beginner in the area of NGS and data analysis. I have good amount of experience with perl and UNIX. I have acquired some basic knowledge on NGS from biostars and literature. I am also solving problems on ROSALIND.

It would be very helpful if you share your views on 'What kind of programming related questions would you ask a beginner NGS data analyst' if you are the in the panel.

programming Interview • 8.6k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 9.9 years ago by venu 7.1k

Ram · Answer 1 · 2015-08-24

Bioinformatics is a really broad field. I don't expect candidates to know the deep details of various algorithms, or to know anything highly specific. Rather, my goal is to see how the applicant approaches computational problem-solving and to verify that they have basic competence in the skills they advertise.

I always ask the first two questions:

Take a simple paragraph of standard English and represent it as a data structure in your chosen markup or programming language. I always include things which can be represented as arrays of multi-property objects. I then ask follow-up questions about how the candidate would query that data structure and see if they change their mind (it's fine--usually a plus--if they do change their mind). I then show my method and ask them to assess theirs against it. It's not a difficult problem for an experienced computational problem-solver and there's no correct or incorrect answer--I just want to see if they panic and how they reason through the problem at hand.
I give a text-munging question, since that's a big part of day-to-day drudgery. Use your favorite language to open a CSV file and restructure each row. I'm looking to see if the user can take a list of multiple simple requirements and write decent code from it.
JavaScript is the biggest charlatan magnet in the programming world today. It's a sexy language and it makes me mad that so many people who don't even know what an anonymous function is defile JS by claiming proficiency in it. If someone puts JavaScript on their resume I ask them a question about JS objects. Basically I show them some code and say "what's the output?" The cleverest of the charlatans will know to execute the given code in a browser console and will provide the correct answer. But then the fun comes when I ask "why do you get that answer?"

Adept JavaScripters will quickly know the answer to this, even if they jumped the gun and answered the output question incorrectly at first.

People who have written code at just a basic level in JavaScript will have a good time reasoning through it and reaching that "Ah ha!" moment. Good enough for me.

Charlatans will give some hilarious answers, or just get downright combative.
For people who advertise MySQL or other databases, I ask them to explain in everyday words how they would structure a relational database to solve a simple need. I'm looking to see if the user truly understands the basics of relational databases or if they've merely run a few SELECT queries at some point in their lives.

You might (or might not) be shocked at how many candidates who are strong on paper completely crumble when faced with the above. I've had other candidates sail right through it, which tells me I'm not being too harsh with the questions.

Ram · Answer 2 · 2015-08-24

2

Entering edit mode

9.9 years ago

Gjain 5.8k

Hi Venu,

I personally think more than these questions, you need more experience with NGS projects. A good way to get a hands on experience is by doing one of the specialization from coursera.

Specialization: Genomic Data Science
First class starting: September 7th, 2015
Link: https://www.coursera.org/specialization/genomics/41
Courses:
- Introduction to Genomic Technologies
- Genomic Data Science with Galaxy
- Python for Genomic Data Science
- Command Line Tools for Genomic Data Science
- Algorithms for DNA Sequencing
- Bioconductor for Genomic Data Science
- Statistics for Genomic Data Science

This is a slightly different answer to your question but I think this will help you develop your skills in this field if that is what you are looking for.

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 9.9 years ago by Gjain 5.8k

0

Entering edit mode

Thank you Gjain. I will definitely start a course. I thought if I get to know basic problems encountered during the data analysis workflow those can be solved through simple scripting / UNIX pipes would help me in my interview assessment.

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.9 years ago by venu 7.1k

1

Entering edit mode

That is the main idea behind these courses as they have projects and practical examples on the questions you encounter in day today life in a lab ...good luck with the interview.

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.9 years ago by Gjain 5.8k

0

Entering edit mode

Thank you :)

ADD REPLY • link 9.9 years ago by venu 7.1k

Ram · Answer 3 · 2015-08-24

2

Entering edit mode

9.9 years ago

John 13k

There are two distinct classes of 'programming in NGS data analysis':

To make the scripts/programs that glue together the inputs and outputs of other programs. Occasionally with a bit of mild novel analysis thrown in, but barely anything beyond basic aggregate functions (min, max, sum, mean, stddev, etc)
To write novel statistical analyses that outperform, based on some arbitrary metric, other programs in the field. Alignment, prediction, modelling, clustering, etc.

When I started doing NGS analysis, I thought 2) was a lot more impressive than 1). These days I'm not so sure. Whilst 2) requires the skills you might expect in a typical programming analysis job (understanding of data structures, algorithms, optimisation, statistics/mathematics, etc), you have the luxury to define your own formats, standards, etc - and everything is somewhat under your control.

Number 1) however is a different kind of programming entirely. It's about understanding the programs you glue together like they were your own children - knowing their strengths and weaknesses, their quirks, their bugs, their parameters, their wacky file formats, and working under what seems like constant uncertainty. To do it as a profession where you have deadlines to meet and multiple projects to juggle is an art form.. Some people are very good at it, whilst others - who are fantastic programmers in their own right with hundreds of github projects on the go - are not.

So to answer your question about questions with a question - what kind of job is it?

For a Type 1) job, I wouldn't really care if you knew the difference between a bubble sort, a radix sort and a merge sort. I would be a lot happier if you knew how to use unix sort(1) to sort the first column of a CSV file numerically, then the second alphabetically, using 10 CPU cores and the SSD for temporary space even though the file you are sorting is on a remote partition.

Having said that, this very practical view of interview questions is not always held by interviewers. I've heard stories of people not getting a Bioinformatics job because they couldn't distinguish an Exponential distribution from a Pareto distribution...!!

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 9.9 years ago by John 13k

1

Entering edit mode

Thank you John. From the knowledge of initial discussion, I can say it would be more like type 1 of your answer but I also observed in the provided information paper 'design and operation of massively parallel DNA sequencing data analysis pipelines'.

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.9 years ago by venu 7.1k

2

Entering edit mode

In that case, between now and your interview I would spend half my time learning how bash/awk/grep/sort/sed/head/tail/diff work - somewhat in depth - and the other half using tools like samtools, bwa, tophat, picard, and all the other famous bioinformatics programs work. Don't get me wrong, learning "Algorithms for DNA Sequencing" is good fundamental knowledge which might come up in a lab meeting or something, but I think your time would be better served going in with the documentation for those programs in your head, and some experience using them under your belt. It doesn't have to be much - just enough to show that you've experienced it and you know what you are getting into :)

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.9 years ago by John 13k

1

Entering edit mode

Thank you very much. I will follow this approach but I also want to know the basic problems we encounter during the data analysis workflow. I will solve these problems on my own which increases my confidence and I also get to know the real problems in the analysis. I would be grateful if you direct me to any blog that discusses data analysis workflow and problems which a beginner should read.

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.9 years ago by venu 7.1k

1

Entering edit mode

Hm - this is a tough question because the problems are rarely presented in a tutorial workflow :P

Solving 'real problems' in analysis is what Biostars is all about - perhaps a good method would be to try out a tool and get it to work in a workflow as part of your coursera studies, and then once you think you understand the tool, type the name of the tool into Biostars to see just how much pain it can cause you in a real analysis ;) hehe

Alternatively - and this is a bit of shameless self-publication here - I'm just about to publish on Biostars (in the next day or two) a tool which can be used to publicly log command-line-based workflows. It's got an opensource/transparency/community vibe to it, but the basic idea is to get people sharing their workflows and the whole bioinformatic process publicly, without running into ethical/legal problems of sharing data. You can view the public workflow database at http://log.bio, but since there's only 1 command there right now, and that's likely to be spontaneously deleted, I wouldn't worry about it now. Maybe check it out in a week from now to see what people are doing :)

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.9 years ago by John 13k

0

Entering edit mode

+1 for your suggestion on 'real problems'. This helps me a lot. And log.bio seems to help people (like me & other beginners) a lot. Thank you very much.

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.9 years ago by venu 7.1k