Question: Modifying fasta file based on vcf information
1
gravatar for Jautis
2.4 years ago by
Jautis240
United States
Jautis240 wrote:

Hello, I'm trying to form a fasta file to represent the genome of a sub-population. Currently, I have the initial reference genome (fasta) and a vcf file with two individuals from the sub-population. How can I randomly substitute one of the genotype calls from the vcf into the reference genome at the designated position? 

 

Thank you very much!

reference genome fasta vcf • 1.9k views
ADD COMMENTlink modified 5 months ago by adeena_hassan40 • written 2.4 years ago by Jautis240

I know of gATKs AlternateReferenceMaker, but I have  been unable to find how it decides which genotype to replace the reference with if multiple are listed. What I'm looking to do is randomly substitute one in (I'd also like to be able to just pick the most common, but that is a separate query). 

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Jautis240
1
gravatar for Matt Shirley
2.4 years ago by
Matt Shirley8.0k
Cambridge, MA
Matt Shirley8.0k wrote:

You can use the FastaVariant class in the pyfaidx module if you're comfortable with a bit of Python. Sounds like you want something very specific so a script might be your best option. 

ADD COMMENTlink written 2.4 years ago by Matt Shirley8.0k

The FastaVariant tool does seem like it should suit my purposes, but I'm getting a stack smashing error when I try to run it. Is there a forum for the program where I can post this (I haven't been able to find one)?

ADD REPLYlink written 2.4 years ago by Jautis240

Yes, please report this in the github repository issue tracker

ADD REPLYlink written 2.4 years ago by Matt Shirley8.0k
1
gravatar for Len Trigg
2.4 years ago by
Len Trigg960
New Zealand
Len Trigg960 wrote:

The simulation tools that are part of RTG Core should let you do this, depending on exactly what you want. E.g:

# RTG commands use a formatted structure (SDF) for random access to parts
# of the data, so format the reference
rtg format -o reference.sdf reference.fasta

# Annotate your population VCF with allele frequency information from the
# samples (if it doesn't already have it)
rtg vcfannotate -i input.vcf.gz -o popvariants.vcf.gz --fill-an-ac

# Generate a new VCF sample column, randomly generated based on allele
# frequency information in the population vcf, plus an SDF containing the
# genome of that individual
rtg samplesim -t reference.sdf -i popvariants.vcf.gz -o pop_plus_synthetic.vcf.gz --output-sdf synthetic_ind.sdf --sample synthetic_ind

# Extract FASTA from the SDF
rtg sdf2fasta -i synthetic_ind.sdf -o synthetic_ind.fasta.gz

The defaults assume a diploid organism, so the output will contain two copies of each chromosome. If you want to get fancy, you can configure your reference with information about autosomes/sex chromosomes and you will end up with the appropriate sequences in the output. We use these tools for simulating small populations/pedigrees of human genomes, including random population variants (popsim), founder individuals (samplesim), offspring (childsim), and de novo variants (denovosim).

 

 

 

ADD COMMENTlink written 2.4 years ago by Len Trigg960

Thankyou, looking at the literature, that seems like it should work very well. However, I'm having issues getting rtg to run. I have an older version of java in /usr/bin/java and don't have the permissions to change it. Any idea how I can use a version of java saved in ~/java instead?

ADD REPLYlink written 2.4 years ago by Jautis240

In the RTG installation directory there is a configuration file rtg.cfg where you can set RTG_JAVA to point at whichever version of java you want. (There is more information in the user manual, and if you have further questions, it may be more appropriate to post them to the rtg-users discussion group)

ADD REPLYlink written 2.4 years ago by Len Trigg960
0
gravatar for Ashutosh Pandey
2.4 years ago by
Philadelphia
Ashutosh Pandey11k wrote:

If there are multiple variants at a position then it will randomly select one of the alleles or genotype calls. See the caveat section: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_fasta_FastaAlternateReferenceMaker.php

 

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Ashutosh Pandey11k

I see that, but I would like the reference to still be an option as well. For example, if the reference was A and the individuals were A/A and A/C I would like a 75% chance alternate reference is A and 25% chance that it is C. 

ADD REPLYlink written 2.4 years ago by Jautis240
0
gravatar for adeena_hassan
5 months ago by
adeena_hassan40 wrote:

Hello jautis ,

I have the same task could you guide me how to do so ?? how u done this ?? i mean how to applies all the variations in a VCF file to the reference genome to create a sample genome ?? i'm new in bioinformatics .. Thanks

ADD COMMENTlink modified 5 months ago • written 5 months ago by adeena_hassan40
1

I used the GATK method suggested by Ashutosh Pandey. It worked pretty well

ADD REPLYlink written 5 months ago by Jautis240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 604 users visited in the last hour