Computer logistics of comparing genomes within same individual
3
0
Entering edit mode
6.2 years ago

Hello,

I am attempting to compare two genomes from the same individual and finding small differences between them. My idea would be to write a script to find small windows (~15-20 BP) that are present in one genome (lets call it abnormal) but not the other (normal.) The program will break each genome into ~10-20 BP windows, using a sliding window. The windows go into two databases, one for windows from a normal cell genome and one for windows from an abnormal cell genome. The database would be keyed by the actual window itself and for each key there would be a link to each spot in the genome where it is found.

The idea is to do the following:

  1. For each window in the abnormal database, see if it is also present in the normal database. If it is, delete it.
  2. Return all windows that remain in abnormal database, prioritize those with the highest number of occurances.

I have done this for small test data. For real human genomes, there will be about 6 billion or so windows per database. Right now I have two ways I can think of handling this:

  1. Install OS and MySQL on a 1TB SSD, use script to directly populate databases. Use normal RAID for mass storage needed for original genomes. Use database queries to compare genomes.

  2. Install OS on 1TB SSD, with around 700 GB for swap. Use RAID for mass storage of genome and MySQL DB for results of comparing genomes. Instead of comparing databases within MySQL, implement simple hash tables in Perl script, let OS go to swap as structure builds. Do comparisons within Perl script and dump results to MySQL database on the RAID.

What would be the best way to handle this? Is there an existing solution out there that does exactly what I'm trying to do?

Thank you in advance!

genome perl mysql WGS • 1.9k views
ADD COMMENT
1
Entering edit mode

Do you have a reference genome for this organism? If so, why not perform variant calling and compare the results?

ADD REPLY
0
Entering edit mode

That’s a good idea and I considered that too.

Unfortunately, the application I had in mind only works on a per-individual basis. The typical approach I've seen others do is derive drug targets based on genes found across an entire population. This idea is definitely in the personalized care category. Even the small variations between different individuals would throw things off. That's part of the reason I want to run this little script. I already know the variation between individuals in a population is too much, but the genome of cells within a person may be too similar, which would make the idea unworkable. It would be nice to vet the idea computationally first before trying any wet work.

ADD REPLY
3
Entering edit mode
6.2 years ago

First, stop and think. The last thing you want to do is re-invent the wheel. Plenty of people already do look at the differences between, say, normal and cancer cells. It's not like this is some brilliant idea that no one has ever heard of before. You need to understand what people already do, and why. Many people with PhD level understanding of biology and computer science work on this.

Forget about the computer specs. Do you understand that the raw data of "someone's genome" is not a single 3 billion letter long string? The technology doesn't work that way. And that a lot of 20-mer sequences will not be unique in the genome? Do you understand that since 99% of our DNA is shared by every human, comparing to a reference sequence works pretty much fine?

If simple perl hash table was all it took to destroy cancer cells, people would already be doing that. You are not the first person who knows enough computer science to throw around "RAID" and "MySQL" who is looking at this issue.

ADD COMMENT
0
Entering edit mode

Thanks for being so patronizing. I understand and considered everything you just said and even know a good deal of ‘what people do and why.’ I also amassed this knowledge by learning as I go, which is what I am attempting now. Being this harsh does not help anything. I understand the biology of what I am doing, I understand basic workflow and even recently learned how quite a few nextgen sequencing methods work. Yes, I understand the sequences are not on a single string, also aware many 20-mers will not be unique. No, I am not an expert with bioinformatics tools- yet. I am a quick study- I will learn bioinformatics and every other tool I need, whether people want to help or not, but help would be nice. People could certainly expect me to return the favor.

There is a very specific reason why the 99% similarity will not be good enough, you just decided to tear into me instead of inquiring further on what my main idea is about. If you are any indIcation of this community, I will take my questions elsewhere. The thing is, everyone has to start somewhere. I am going back for my PhD soon, I had an idea I want to vet out and see if I can make it my project. I did run my idea by one of the chairs at MD Anderson and she and her postfellow liked it and said I should go back to school and see if I can leverage for my PhD work.

To be certain, I am not re-inventing the wheel, but I have been kicked in the gut by people like you enough. Instead of offering genuine help by pointing me in the right direction, you want to preempt the idea entirely. Sometimes people learn through trying, even if you think they are going the wrong direction. Sometimes people might even surprise you. I realize how arrogant it must be to think I could possibly have an idea no one has thought of, but how do you think innovation happens? How far would we get if everyone had the same mentality of ‘if it’s such a great idea then why hasn’t anyone done it?’ Or ‘Good contributions cannot come from outside the field’. I have considered this and I still want to try to vet my idea as far as possible. Maybe my idea bears passing resemblence to something others have tried in your mind, but I am taking an incrementally different approach which could make all the difference.

So thanks for your feedback, here’s a little of my own. If you think an idea ( also helps if you ask questions to know what the whole idea is) is flawed, point out a specific reason. Vague and condescending feedback helps no one. The chair I ran my idea by could have had the exact same reaction. She was well aware of how my education and experience paled in comparison to hers and could have torn my idea to shreds. Instead she listened curiously, thought about it, had a few questions, listened to my answers- and said yeah that might work, run with it!

ADD REPLY
3
Entering edit mode

Don't get so upset, swbarnes2 was pretty good given the level of information you provided in your question. If you want more helpful questions, describe your problem in more detail, and state why you think the current methodologies will fail.

Read Tutorial: How To Ask Good Questions On Technical And Scientific Forums for a very detailed guide on how to improve your questions.

ADD REPLY
0
Entering edit mode

I am not running my general idea past my this forum-yet (defintely will if I get past vetting this part because it would be nice if others could replicate it.) Besides, I have already done so with experts in the field. The main question everyone I approached had (myself included) is if there would be enough targets that my idea could use. My impression is that the people I approached could easily do it themselves, the chair I spoke with is an expert in targeted therapies and bioinformatics for instance. What I need now is to acquire the skills necessary to answer that question and get back to her. I just wanted help with the tools involved.

I am no stranger to asking for help on scientific forums either. What further information is needed? I stated I wanted to establish a large database of k-mers from two different genomes (normal, abnormal) and compare the two, eliminating ones that are present in normal cells. I provided enough information to that effect, so no- I didnt’f find swbarnes2 really being helpful at all.

ADD REPLY
3
Entering edit mode

I provided enough information to that effect

Do you have two chromosome-level assemblies, one for each cell line? Or do you have variants mapped? Phased or unphased variants? Or do you have sequenced each line? Which platform of sequencing? What amount of data?

Your original post imply (but do not state clearly) you want to analyse two chromosome-level assemblies. Do you know how difficult is to produce chromosome-level assemblies?

ADD REPLY
0
Entering edit mode

Ok swbarnes2’s reaction makes a little more sense now. To clarify, I am not trivializing anyone’s education or experience and naive about the difficulties ahead. I absolutely feel intimidated by the sheer scope of the problem, especially producing chromosome-level assemblies. Intimidation alone doesn’t stop me. I take on big opponents all the time; not trying to be arrogant, but I win a surprisingly number of times by sheer perseverance. When I fail, I learn a great deal just through trying.

One of the better ways to get someone to feel the way you feel is to mirror them, which is why I responded the way I did to swbarnes2. I did look over his/her comment history and reputation first and I acknowledge he/she responds great under most circumstances. People get off on the wrong foot sometimes, won’t hold a grudge if he/she won’t.

Trying to respond at various stops along my trip, will respond to your more in-depth questions in a bit.

ADD REPLY
2
Entering edit mode

Better that we all get along and work together. Emotions/feelings are very easily misinterpreted on the WWW. Sometimes I feel that we way we interpret someone's comment/reply more reflects our own frustrations, and I'm guilty of this too, many times adopting defensive tones.

Apologies for butting in - I volunteered 7 years as online counselor.

ADD REPLY
0
Entering edit mode

Hey that's a really good insight and I appreciate the intervention. You are right on about me being frustrated. Since we have a counselor present, let’s talk.

I believe swbarnes12 thinks he/she was helping, but ever come across some part of human nature that's aggravating? Imagine if your child had plagiocephaly and about 1/12 people who pass you by say "Aww cute helmet! Does your kid fall down a lot?" It’s a Markovian process, so guaranteed to happen again. Understandable, but aggravating.

People are conditioned to expect others to be fully qualified before they even attempt to tackle a problem, especially a hard one like cancer. I take an unconventional approach to hard problems. I use my intuition for general direction. Then, instead of acquiring needed expertise up front, I acquire it as I go. It usually works for me and even when it doesn't, I learn greatly.

For example, I designed and built a SLS 3D printer from the ground up. Most people I approached with this endeavor balked at my background. I was constantly informed that I would need expertise in multiple domains. I was constantly chided with something to the effect of "only big companies with a large team and a lot of resources can do that." I actually did it, pretty quickly too.

Check out these (extremely campy) videos we made for applying to a startup accelerator:

Due to 3D Systems stock tanking, we couldn’t secure the investment needed to go to production. As a dying effort, we tried crowdfunding. About halfway into it, we got contacted by all the major 3D printing companies. Most of them commented to the effect of "We've seen a few people try to do what you're doing, but this is the first time we've seen it done 100% correctly." It wasn't a total loss because I gained expertise in a number of technical fields.

I haven't found an effective way to educate people in advance about me. Thus, I get reactions analogous to the ones those kid’s parents might get. For example, here some of the nicer (though still crummy) responses I get:

"Generally speaking, I’d guess that someone without an extensive professional training and research experience in the area is unlikely to come up with an idea that many, many people who are devoting their careers to this problem somehow missed. Sorry to be so brutally honest, but I would not be doing you a favor otherwise." - Name redacted

"You're basically asking if <x> can be used for therapeutics. There are literally thousands and thousands of people focusing their lives on researching these topics for years. Based on your questions, I'm going to assume you're very far removed from this stuff, which makes it tricky to answer except in a ELI5 way." - Name redacted

Now look at swbarnes12's comment:

"It's not like this is some brilliant idea that no one has ever heard of before. You need to understand what people already do, and why. Many people with PhD level understanding of biology and computer science work on this." ... "If simple perl hash table was all it took to destroy cancer cells, people would already be doing that. You are not the first person who knows enough computer science to throw around "RAID" and "MySQL" who is looking at this issue."

It's an understandable reaction, but aggravating. Approaching problems as I do is a unique affliction, just like plagiocephaly. I chose that condition as an example because it's not super rare, but uncommon enough that people might have a seemingly innocent, but hurtful response. Likewise, there is little prospect for preparing the people I approach for help in advance. I have to endure this reaction from each community I approach.

Look, I am all for making nice but let’s be sincere. My interpretation of swbarnes12's message was dead-on. If anyone thinks I am a victim of the Dunning-Kruger effect, I can understand. I hope I can get understanding in return. I tackle problems differently than most and I am met with constant incredulity. For the first time I am confronting the fact that I cannot efficiently solve a problem on my own. I'm being so cringeworthy, but it's just too big and I need help. The reaction I pointed out happens too often and those words are deeply felt. They are especially troubling, since it could potentially discourage others from helping.

Staying true to the analogy, I’m sure that if that child’s parent snapped back angrily the passerby would be confused and hurt. I should have just explained more. So if you’re reading this swbarnes12, I am sincerely sorry.

Look, I am not a mere dilettante- I will gain a substantial depth of knowledge as one would expect. I am confident I can do this, but it will take help from everyone. I realize it's asking for an investment, but I promise I will contribute back to this community in return, especially once I have results.

ADD REPLY
1
Entering edit mode

Not a counselor mate.. there's value in just listening to people, be it in person or online. Most just want to feel that at least one person cares about them. During Christmas (northern hemisphere due to cold and low light), numbers of those in distress does increase noticeably. Best of luck in your endeavours!

ADD REPLY
0
Entering edit mode

Ah, well it's just a title right? You help people and have emotional intelligence. It's really admirable. Yes, I guess I am in a bit of distress. It's cool you noticed and offered your help. Thanks for wishing me luck, I will need every bit of it.

ADD REPLY
1
Entering edit mode

Look at the message on my profile!

ADD REPLY
0
Entering edit mode

OK, sorry for the delay. Yes, I am absolutely intimidated by the task ahead.

As you correctly surmised, nextgen high throughput sequencing enables this idea. In my wildest dreams, I would like individual chromatids to be pulled from several (of both normal and metastatic) live cells around anaphase and sequenced using any high throughput method that can still detect things like SNPs. I realize that's not trivial either, but I don't need to worry about this yet- couldn't I get an rough estimate from much less?

Another complication is coping with small genetic variations between normal cells and the larger variations between tumor colonies. I think I have a way around this, but it’s not elegant. Ultimately the bioinformatics part will par down the possible K-mers to target. It could be that very few survive this process and maybe answer why no one has tried/done this yet.

The reason to par down the number of candidates is because the next step will get expensive quick if I don’t. At this point, a drug vector will be synthesized for each and then an assay (with more sequencing) will be used to confirm which vectors to keep. Vectors that show activity on normal cells or no activity on cancer cells in the assay are excluded. Eventually, the vectors that are kept will be produced at high enough concentration to infuse a PDTX.

As you probably know, combating resistance is especially important for stage-IV patients whose cancer must be managed as a chronic condition. I am hoping multiple vectors administered as a kind of cocktail will cope well against an evolving system like cancer. Doctors are administering multiple drugs at once in practice, but it's usually two or three at most because it becomes intolerable for the patient. Since all the vectors are pretty much the same, I reason that if the patient tolerates one of them it should scale nicely for the rest. If their cancer evolves resistance, it might even be possible to repeat the protocol. The different vectors are made relatively the same way, so making a few dozen that target different k-mers scales well cost-wise. I did confirm with a few biotech companies that they could synthesize the vectors before investing much time in this idea. The only negative feedback I got was from a smaller one, which said ‘You propose an interesting idea. Sorry if this is inconvenient for you, but we don’t have the resources to help you with your project.’ The larger ones found it pretty interesting and want to hear back if the rest pans out. This is all way down the line though, but figured I’d talk more about it.

To answer your questions (as best I can -still learning.) Looks like they used the standard HGSC cancer analysis workflow. So far I have data from two cancer patients rolling in, I am estimating around 4-5 TB of data. From looking through some of the other directories, it looks like they already did the following: Variant calling for PInDel, Atlas-SNP, and Atlas-Indel, raw PInDel output, and viral analysis. Sequences in fastq, WGSs for both normal and tumor cells, around 16 GB a piece. Huge (to me at least) .BAM files, around 0.2-0.5 TB a piece.

I am not sure if they already did the chromosomal assemblies already, to be honest. Forgive me, but how would I be able to tell?

Here's a look at one of the directories:

PATIENT-N-WEX.read2.fastq.bz2, 4 GB

PATIENT-N-WGS.bam, 210 GB

PATIENT-N-WGS.lane1.read1.fastq.bz2, 15 GB

PATIENT-N-WGS.lane1.read2.fastq.bz2, 14 GB

PATIENT-N-WGS.lane2.read1.fastq.bz2, 15 GB

PATIENT-N-WGS.lane2.read2.fastq.bz2, 15 GB

PATIENT-T-WEX.bam, 20 GB

PATIENT-T-WEX.read1.fastq.bz2, 3 GB

PATIENT-T-WEX.read2.fastq.bz2, 3 GB

PATIENT-T-WGS.bam, 402 GB

PATIENT-T-WGS.lane1.read1.fastq.bz2, 15 GB

PATIENT-T-WGS.lane1.read2.fastq.bz2, 14 GB

PATIENT-T-WGS.lane2.read1.fastq.bz2,, 15 GB

PATIENT-T-WGS.lane2.read2.fastq.bz2, 15 GB

PATIENT-T-WGS.lane3.read1.fastq.bz2,, 17 GB

PATIENT-T-WGS.lane3.read2.fastq.bz2, 16 GB

PATIENT-T-WGS.lane4.read1.fastq.bz2, 14 GB

PATIENT-T-WGS.lane4.read2.fastq.bz2, 14 GB

ADD REPLY
2
Entering edit mode
6.2 years ago
h.mon 35k

You may try to construct k-mer profiles and find the differing k-mers, as was done in Efficient identification of Y chromosome sequences in the human and Drosophila genomes.

edit: in view of your comments, maybe vg could also be of interest to you.

ADD COMMENT
0
Entering edit mode

Thank you, that looks like a really good resource. I really appreciate the direction. I am in the middle of driving up to Houston, will deep read it tonight when I get back.

ADD REPLY
2
Entering edit mode
6.2 years ago

You want to be looking for how people detect either a) somatic mosaicism or b) tumor-specific (somatic) variants. There are a wealth of tools for doing this by setting one sample as the normal and comparing to find the changes in the other. From your description, it seems as though your method would be grossly inefficient, and you would probably be wasting a lot of time implementing it. (I say this not to criticize you, but to try to help you find the most efficient way to get the answers that you're after!)

If you've already researched these tools and found that they will not meet your needs, then you should describe why, and we may be able to help you determine the best way to extract the information that you need to answer your question.

ADD COMMENT
0
Entering edit mode

No worries, I can sense you are genuinely trying to help me. I can handle criticism as long as it’s not vague and challenging based on the idea’s merits rather than how audacious or naive I am.

I totally get the efficiency part, but the k-mers I use have to be found nowhere else in normal cells (or perhaps outside ORF), or at least a substantial number of them have to have this property. When I do the wet lab stuff (if my idea even progresses that far), will eliminate the targets which harm normal cells, but want to make sure that they are in the minority. I wrote a script using BioPerl and generated K-mers from two homo-sapien WGS of normal, metastatic pairs. I had to sign a NDA first under the condition of not trying to re-indentify the participants or discuss other details. I took a look at the data, did a low-order, back of the envelope calculation on how huge this k-mer database would be, so that’s when I came here asking for help.

Largely, I am ignorant of the available tools, which is why I am here in the first place. I did basic translation of ORFs in WGS of haploid (mostly parasite) protozoa and did BLAST searches on protein databases and curated the results for my PI as an undergrad. I wasn’t a bioinformatics expert from that summer program, but it left me with a taste of how poweful bioinformatics can be. I have to admit the PI coddled me a bit. I have my degree, will be seeking a terminal degree very soon. I don’t need to be coddled this time around, just need a running start like more and more people here are starting to provide. So thanks again for that- it’s exactly what I am looking for.

ADD REPLY
1
Entering edit mode

This seems quite analogous to the problem of identifying tumor-specific antigens. You might look at the ways in which that's implemented in immunotherapy pipelines (i.e http://pvactools.org) for some rough ideas.

I still don't really understand your end goal, but I'd start by: 1) Identifying all changes between the two individuals at the base pair level first (using one or more somatic mutation callers) 2) use faidx or similar to extract the flanking sequence for each variant (so that you can get the whole kmer) Don't forget to account for mulitple hits in your windows, and phase them appropriately. 3) blast those altered sequences back against the reference genome to identify those that are really unique. Or if you need to be more specific, get germline calls from your normal, spike them into the reference genome, then blast against that one.

ADD REPLY
0
Entering edit mode

So I may have found a way to kill a cell based on directly recognizing specific, short sequences in its DNA. And before anyone asks, no it's not using CRISPR to insert some kind of death gene, repairing damaged genes or by creating a knockout. I will share the specifics later after I get past this critical stage, mainly to avoid reactions from earlier.

Rest assured, I have run this idea by experts who understand the biology involved and it's pretty solid. The hard part everyone points out is finding sequences that are only present in the cancer cells. Also, for this idea to work, I will need multiple targets. I figured breaking each genome into short k-mers and eliminating the common ones would give a good estimate on how many targets might emerge.

I just want to run a script to confirm whether or not there are enough targets for my idea to work. If I run it on the genomes I have and only a handful of targets are found, it's obviously not going to work and I will move on.

ADD REPLY

Login before adding your answer.

Traffic: 1474 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6