Entering edit mode
5.3 years ago
prashant10991
•
0
I have a bam file and corresponding vcf file from some source. I am trying to slice the DNA across its mutation location to feed into one of the algorithms.
I wanted to know what is the right way to do this. I have two options
- I can create a genome assembly/contigs from the bam file and then slice it using mutation information from vcf file.
- I can take a reference genome and then slice it from vcf file.
What are the pros and cons of either method?
Please tell us more about the goal of your "algorithms". Depending on that, the one or the other way might be better.
In general: If you variants are not phased and you create a consensus sequence, from where you like slice a region, you cannot be sure, that the variants next to each other are on the same strand. You need to know if this is important for your "alogrithms".
Specifically, I am trying to learn the distributed representation of variants using some similar strategy to word2vec. So I want to slice DNA of 2*K+1 length centered around a mutation. But, I am in a dilemma of what is the correct way to slice the DNA so that most of the information is preserved.
To more clearly stating my doubt, Is it wise to use the reference genome against a patient-specific vcf file to slice DNA or one should first create gnome assembly/contigs (since read length is short and K > 200) from patient-specific bam.
How much information loss will occur in either case?