Require suggestion for python tools that can implement VCF file into FASTA files
2
0
Entering edit mode
10 weeks ago
Rachayita ▴ 10

I am working on 1001 genomes from the 1001 genomes project. I wish to implement the VCF file into FASTA files incorporating the indels, etc to better work on the genomes. I know about GATK but is there any other tool that can help me achieve this, preferably in python environment?

fasta vcf • 881 views
ADD COMMENT
0
Entering edit mode

First do you really mean ‘1001’ genomes or ‘1000’ genomes? Then, what do you mean by ‘implement’, do you really mean convert? Finally, it’s not clear how introducing indels into a fasta makes any sense. Why would having the information in fasta format help?

ADD REPLY
1
Entering edit mode

Thank you for your response. 1001 genome project is a catalog of Arabidopsis thaliana genetic variation. I want to incorporate the SNPs of the VCF into the reference genome so that I have separate fasta files for all the individual genomes corresponding to all the individual VCFs.

ADD REPLY
2
Entering edit mode
9 weeks ago

Have a look at bcftools consensus:

Create consensus sequence by applying VCF variants to a reference fasta file.

(Not a Python tool, though.)

ADD COMMENT
0
Entering edit mode

Could you please give me an idea on how to do that? Thank you!

ADD REPLY
0
Entering edit mode

There's an examples section in the link.

ADD REPLY
1
Entering edit mode
9 weeks ago
Alban Nabla ▴ 20

I think vcf-consensus-builder may be what you are after.

ADD COMMENT
0
Entering edit mode

Thank you for your response. I looked into vcf consensus builder, but looks like it needs a depth file to work. I have just the vcf files and the reference genome fasta. Any other alternatives are highly appreciated.

ADD REPLY
0
Entering edit mode

I am afraid if you don't have access to the .bam files or a depth file, it will be hard to build a consensus.

I drafted a (clumsy) script to integrate the vcfs for one sequence/chromosome:

from Bio.Seq import Seq, MutableSeq
from cyvcf2 import VCF
from Bio import SeqIO

# parse the files and extract one chromosome
genome = SeqIO.parse("genome.fasta", "fasta")
chrom = next(genome)
chrom1A = chrom
chrom1B = chrom
vcf = VCF('variations.vcf')
chromvars = vcf('1')

def integrate(vcf,chrom):
    chromA = MutableSeq(chrom.seq)
    chromB = MutableSeq(chrom.seq)
    shiftA = 0
    shiftB = 0
    for var in vcf:
        if var.var_type == 'snp':
            if len(var.ALT) == 1:
                chromA[var.POS+shiftA-1] = var.ALT[0]
                chromB[var.POS+shiftB-1] = var.ALT[0]
            if len(var.ALT) == 2:
                chromA[var.POS+shiftA-1] = var.ALT[0]
                chromB[var.POS+shiftB-1] = var.ALT[1]
        if var.var_type == 'indel':
            reflen = len(var.REF)
            if len(var.ALT) == 1:
                chromA = chromA[:var.POS+shiftA-1] + var.ALT[0] + chromA[var.POS+shiftA+reflen:]
                chromB = chromB[:var.POS+shiftB-1] + var.ALT[0] + chromB[var.POS+shiftB+reflen:]
                shiftA = shiftA + len(var.ALT[0]) - reflen
                shiftB = shiftB + len(var.ALT[0]) - reflen
            if len(var.ALT) == 2:
                chromA = chromA[:var.POS+shiftA-1] + var.ALT[0] + chromA[var.POS+shiftA+reflen:]
                chromB = chromB[:var.POS+shiftB-1] + var.ALT[1] + chromB[var.POS+shiftB+reflen:]
                shiftA = shiftA + len(var.ALT[0]) - reflen
                shiftB = shiftB + len(var.ALT[1]) - reflen
    seq1 = Seq(chromA)
    seq2 = Seq(chromB)
    return seq1, seq2

# integrate using the above function
chromA, chromB = integrate(chromvars,chrom)
chrom1A.seq = chromA
chrom1B.seq = chromB

# output to fasta
SeqIO.write(chrom1A, 'modchrom1A.fsa', 'fasta')
SeqIO.write(chrom1B, 'modchro1B.fsa', 'fasta')

This is for one chromosome, but you can iterate through a full genome and automate the whole process. I am sure there are better ways but I hope this helps.

ADD REPLY

Login before adding your answer.

Traffic: 1246 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6