Reproduce the article "The complete sequence of a human genome."
4
5
Entering edit mode
5 months ago
sqshigg ▴ 60

My mentor instructed me to reproduce the article "The complete sequence of a human genome." As a novice in the field of bioinformatics with limited knowledge, how should I proceed?

genome assembly • 3.6k views
ADD COMMENT
10
Entering edit mode

No. This is not how this goes. Tell your mentor to guide you. This is a challenging analysis, dealing with big data, lots of theoretical knowledge on underlying principles (assembly, genomics, coding, workflow managing, IT infrastructure) you have to know upfront. The T2T assembly was put together by a large consortium of experts. Just saying "here, take it and reproduce" is really not suficient. This is like saying to an untrained couch potato "hey, run a marathon under 2h, this dude from kenya just did it the other year, so you can, ask the others how to train". If the mentor is not willing or able to provide instructions and hands-on guidance then you better consider to find a mentor who can actually help develop your skills.

ADD REPLY
0
Entering edit mode

He said he would give me six months to study the relevant materials. He also mentioned that the supplementary methods section in the articles contains the related data, software, and methodologies. How should I get started?

ADD REPLY
6
Entering edit mode

Change supervisor, seriously. This is ridiculous and unacceptable. If what you say is true then the're not supervising, but just commanding. No value in that for your personal development.

ADD REPLY
6
Entering edit mode

Lol what's next? Your mentor is going to ask you to reproduce the Manhattan project?

ADD REPLY
1
Entering edit mode

I am very frustrated.

ADD REPLY
2
Entering edit mode

That's very understandable and not at all your fault. A lot of faculty teaching at the university level have received very little formal training in didactics.

Learning by frustration or degrading your personality is not a good approach. Your future mentor (not necessarily the same person) and you must sit down together and set up a curriculum (asking: What do I know already and what do I need to know/do?). Then, break down the tasks into manageable bits and pieces. This includes making a realistic progress plan with precise milestones to measure progress units, going from simple basic tasks to more complicated and complex tasks. Think of the whole process as similar to learning a musical instrument.

Always take the size of datasets into account, and find suitable small practice datasets to work on. The data size you use in the beginning should be appropriate in relation to your computing resources available. This is particularly important when learning and trying things out so that when you make a mistake, it doesn't take forever to repeat a step. You can scale up your process later once you understand what you are doing, why and how it works in principle.

Following a structured approach, you will be able to do basic procedures like using the command line to execute simple tasks, finding reference sequences, running QC analysis, basic genome assembly, variant calling, and some omics analysis and annotation within a few weeks.

ADD REPLY
2
Entering edit mode

what is that article ? who do you mean with "reproduce" ?

How To Ask Good Questions On Technical And Scientific Forums

ADD REPLY
2
Entering edit mode

I am slightly confused. Does you supervisor want you to reproduce only certain parts of the paper or the entire paper? If you supervisor is asking you to reproduce the entire analysis of a Science paper (which is authored by ~100 authors, each potentially brining different analysis to the table) given that you are a novice in the field of bioinformatics then the only way you should proceed is by actually changing the supervisor. Maybe it is worth presenting the entire paper to your supervisor in a short presentation so that he/she can truly realize the gravity of the task they are demanding from you?

ADD REPLY
2
Entering edit mode

This is the most outlandish thing I have ever heard. Tell your supervisor/mentor Biostars thinks he is unfit to serve, should resign, and not be let within 100 Gb of any student.

  • asks a beginner reproduce a recent Science article in 6 months
  • the article has 100+ co-authors, some of the best experts, some working for sequencing companies
  • the original study likely took many years to complete
  • 6 month is shorter than the review process
  • even if you knew how: the raw data consists of hundreds of files, each may be several TB in size
  • I doubt the mentor would be able to do this themselves, nor would they have the storage or compute resources to run these

Edit: I like the marathon comparison, personally, however in this context it is more like asking me to run it in 1 minute (100x). It is simply not possible.

ADD REPLY
1
Entering edit mode

"Reproduce" means to get the original code running on the original data, not "replicate". The 100+ co-authors thing would be more relevant if the OP had been asked to replicate the work. Reproducing is more reasonable, and in many instances might be as simple as cloning a repository and typing "make". Of course in this case it sounds like it's much more demanding because of the data size and computing power needed, but (I can't judge) it might be a reasonable stretch task to achieve over six months - assuming the resources are available to set up servers to handle data this size, computing power needed, etc. I still agree it's not enough support been given by the supervisor, of course.

ADD REPLY
2
Entering edit mode

It is irrelevant in this case how you define "replicate" or "reproduce". Have you even checked the storage requirements? The BioProject contains over 300 files, I bothered checking one of them, it was >3TB. You may easily need 100TB (SRA gives NA as project size, it is likely too big) to just download everything, likely more, and Petabytes to run the computation, in a proper HPC environment. Neither of us could run the complete analysis from scratch.

ADD REPLY
1
Entering edit mode

As a novice in the field of bioinformatics with limited knowledge, how should I proceed?

I would start reading and writing the summary of that science paper with over 100 authors until I find the next PI next door or next school ^.^.

ADD REPLY
0
Entering edit mode

I am new to bioinformatics and I am also trying to make sense of things, if your mentor has given you a task try to approach it and talk to chat gpt about complex terms that you can not understand, you will still have gaps, but you will have something that you would have reproduced. Then you can formulate questions from it ask your mentor or may be ask the forum for answers. It is going to take a while to get everything right and I dont think your mentor expects you to get everything right in the first attempt.

ADD REPLY
3
Entering edit mode

talk to chat gpt about complex terms that you can not understand

This is not good advice for a beginner. ChatGPT is often confidently and authoritatively incorrect and beginners lack the knowledge to question things that could be wrong. One is better off reading blog posts from well known researchers than rely on ChatGPT for anything important.

ADD REPLY
8
Entering edit mode
5 months ago
Mensur Dlakic ★ 26k

What your mentor has asked you to do is unreasonable, and not a little bit. This has nothing to do with you, because it would be unreasonable even if they asked someone who had 5 years of training in the field. That said, none of us know your exact situation, so some of the suggestions you received are also less than reasonable. It is not easy to challenge the authority of someone you want to work with, even when they are wrong.

I don't think training anyone in science by a swim-or-sink approach is justified, and does not indicate a serious commitment from your mentor. My suggestion is to talk with your mentor and tell them that after researching the subject you have concluded that this is not something that can be done without guidance, nor is this a job for one person. There is a reason why teams of scientists work on these projects. I suggest you ask for a smaller project and more guidance. If that doesn't work and you could move to a different lab, that would be my next suggestion. Not sure whether your livelihood depends on staying in that lab so you may not have any other viable options, but wasting time on something you don't want to do or don't understand is a recipe for frustration, resentment and unhappiness.

ADD COMMENT
0
Entering edit mode

If reproduce means (as I think it does) "get hold of the original data and code and get it working" it might not be an unobtainable stretch task, although it is going to be more about IT administration than the subject matter. I agree that sink and swim on this isn't good enough.

ADD REPLY
5
Entering edit mode
5 months ago

Perhaps give your mentor this one-liner, which is the complete sequence of the human genome (the hg38 assembly of it, anyway):

wget -qO- https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz | gunzip -c > hg38.fa
ADD COMMENT
1
Entering edit mode

Perhaps he may want hg19 version of human genome, just one liner again-

wget -qO- https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz |gunzip -c >hg19.fa
ADD REPLY
0
Entering edit mode

I should note that this is a ridiculous answer. But you've been given a silly request. Fight fire with fire!

ADD REPLY
2
Entering edit mode
5 months ago

Perhaps ask your mentor who else is available to help you? Do you have committee members that can also help and advise you?

As with any field, bioinformatics is more about people than software and hardware.

One general bioinformatics resource which I highly recommend is biostarhandbook.com. It's $35 for a two-year subscription. Perhaps ask your mentor if they'd pay for this training resource for you. This resource is constantly updated, improved, and expanded.

ADD COMMENT
1
Entering edit mode
5 months ago

For a moment, I thought this referred to the original paper on the first human genome, published 20 years ago. That would have been quite a feat, as most of the tools and datasets are likely not to exist anymore! But it seems you are referring to this publication from last year: The complete sequence of a human genome.

This is still a very big challenge, though. I don't think it is really worth it to actually re-run all the commands and calculations again, apart from being potentially expensive. You can, however, trace down the workflow they used to do it, and prepare a slide deck about it, as a sort of journal club. The supplementary materials document contains a good description of each step, and could be a starting point. There is also a github page, although it doesn't contain the actual code, but only documentation and references - you will have to go through the other papers to find it.

It is important to note that this article is not alone. When groups complete this type of big consortium work, they tend to publish several papers to describe different aspects of it. You can find the full list in the github page, and you will have to go through all these publications to find the one describing the assembly steps.

Your supervisor has given you a very tough task. It is potentially educative, and you can learn a lot from it - I wish I had the time to go through that paper in such detail! However, it is unlikely to give you opportunities for publication, unless you publish a commentary or figure out something smart out of it. Speak with your supervisor about this, and ask him to guide you through the process, don't let them just dump the work on you and leave you alone for six months.

ADD COMMENT

Login before adding your answer.

Traffic: 972 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6