Question: vcf to bam
1
gravatar for Madhuri Haque
5.1 years ago by
Uniklinik RWTH Aachen
Madhuri Haque20 wrote:

If I have a .vcf file, how can I convert it to .bam file?

bam vcf • 6.8k views
ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by Madhuri Haque20
1

In general, VCF files are derived from bam files. It would be more helpful if you can put the context of your question here. For eg. why you need to convert vcf to bam?

ADD REPLYlink written 5.1 years ago by cpad011214k

to use it in ANGSD for Tajima's D analysis. I've collected my data from 1000 genome for a specific population

ADD REPLYlink written 5.1 years ago by Madhuri Haque20

Do you have bedtools?  Also, what format is your vcf?

ADD REPLYlink written 5.1 years ago by bioguy24190

VCF=variant call format

ADD REPLYlink modified 11 months ago by RamRS30k • written 5.1 years ago by cpad011214k

yes, I do. my .vcf is in ##fileformat=VCFv4.1

ADD REPLYlink written 5.1 years ago by Madhuri Haque20

I also tried it with bedtools. I first converted my .vcf file to .bed file using plink. Then I used bedToBam to convert it to .bam file. But ANGSD couldn't read the bam file. Even I couldn't view it with samtools. Its really depressing!

ADD REPLYlink written 5.1 years ago by Madhuri Haque20

It is not right. BAM = binary alignment, VCF = variant calling file, you can convert VCF to csv, use VCF miner, VCF tools, NextGeneSoftware, perhaps..

ADD REPLYlink written 5.1 years ago by vassialk190

Thank you to all. I want to use the derived .bam file in ANGSD for Tajima's D analysis.

ADD REPLYlink written 5.1 years ago by Madhuri Haque20
1

vcftools has a --TajimaD <integer> option, maybe it is what you want?

ADD REPLYlink modified 11 months ago by RamRS30k • written 5.1 years ago by h.mon31k

what if I want to perform PBS (population branch statistics). Is there any software that can take vcf file?

ADD REPLYlink modified 11 months ago by RamRS30k • written 5.1 years ago by Madhuri Haque20
2

To keep the forum tidy, open a new question instead of using the space reserved for answers to your original question.

ADD REPLYlink written 5.0 years ago by h.mon31k
4
gravatar for Pierre Lindenbaum
5.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

EDIT: found that one in GATK https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_simulatereads_SimulateReadsForVariants.php

Given a set of variants, this tool will generate simulated reads that support the input variants.

I wrote an experimental tool to create a BAM from a VCF: https://github.com/lindenb/jvarkit/wiki/VcfToBam

$ java -jar dist-1.139/vcf2bam.jar -R ref.fa  samtools.vcf.gz 2> /dev/null | grep -v "100="

@HD VN:1.5  SO:unsorted
@SQ SN:seg1 LN:5101
@SQ SN:seg2 LN:4000
@RG ID:NA12878  SM:NA12878  LB:illumina DS:NA12878
@RG ID:NA12891  SM:NA12891  LB:illumina DS:NA12891
@RG ID:NA12892  SM:NA12892  LB:illumina DS:NA12892
@CO Generated with -R /home/lindenb/src/ngsxml/test/ref/ref.fa /home/lindenb/src/ngsxml/OUT/Projects/Proj1/VCF/samtools/Proj1.samtools.vcf.gz
0000003609  83  seg1    1449    60  99=1X   =   949 -500    TGCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTC    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12891    NM:i:1
0000003610  147 seg1    1449    60  99=1X   =   949 -500    TGCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTC    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12891    NM:i:1
0000003611  147 seg1    1449    60  99=1X   =   949 -500    TGCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTC    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12892    NM:i:1
0000003612  83  seg1    1449    60  99=1X   =   949 -500    TGCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTC    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12892    NM:i:1
0000003615  83  seg1    1450    60  98=1X1= =   950 -500    GCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12891    NM:i:1
0000003616  147 seg1    1450    60  98=1X1= =   950 -500    GCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12891    NM:i:1
0000003617  147 seg1    1450    60  98=1X1= =   950 -500    GCTTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12892    NM:i:1
0000003627  147 seg1    1452    60  96=1X3= =   952 -500    TTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCTCT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12891    NM:i:1
0000003629  147 seg1    1452    60  96=1X3= =   952 -500    TTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCTCT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12892    NM:i:1
0000003630  147 seg1    1452    60  96=1X3= =   952 -500    TTTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCTCT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12892    NM:i:1
0000003633  147 seg1    1453    60  95=1X4= =   953 -500    TTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCTCTC    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12891    NM:i:1
0000003635  83  seg1    1453    60  95=1X4= =   953 -500    TTAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCTCTC    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12892    NM:i:1
0000003640  147 seg1    1454    60  94=1X5= =   954 -500    TAGAAAGCATTCCAAAATCTCTTACCAGTTTTATCTCCTATGAAAGTCCTTCACACTTTCTCTCATTTAAACTTTATTGCATTTTCCTCACTTTCTCTCA    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    RG:Z:NA12891    NM:i:1
ADD COMMENTlink modified 5 weeks ago by RamRS30k • written 5.1 years ago by Pierre Lindenbaum131k
3
gravatar for John
5.1 years ago by
John12k
Germany
John12k wrote:

I have also written a tool to go from VCF to BAM, called VCF2BIN.py

#!/usr/bin/env python
import re
import sys
import pysam
import string
import random
exit()

inputVCF = sys.argv[-1]
chromosomes = []
with open(inputVCF, 'rb') as f:
    for row in f:
        if row[0] != '#': break
        if row[2:8] == 'contig':
            for datum in row[10:-2].split(','):
                if datum.split('=')[0] == 'ID': ID = datum.split('=')[1]
                if datum.split('=')[0] == 'length': length = int(datum.split('=')[1])
            chromosomes.append( {'LN':length, 'SN':ID} )
header = { 'HD': {'VN': '1.0'}, 'SQ': chromosomes }
outfile = pysam.AlignmentFile('./useless.bam', "wh", header=header)
for x in range(0,10000):
    chromosome = random.choice(chromosomes)
    readLen = random.randrange(40,80)
    a = pysam.AlignedSegment()
    a.query_name = "read_" + ''.join(random.choice(string.digits) for blah in range(10))
    a.query_sequence=''.join(random.choice(['A','C','G','T']) for blah in range(readLen))
    a.flag = random.randrange(1,4096)
    a.reference_id = chromosomes.index(chromosome)
    a.reference_start = random.randrange(1,chromosome['LN']-readLen)
    a.mapping_quality = 20
    a.cigar = ((0,readLen),)
    a.next_reference_id = a.reference_id
    a.next_reference_start=777
    a.template_length=readLen
    a.query_qualities = pysam.fromQualityString("I"*readLen)
    a.tags = (("NM", 1),("RG", "L1"))
    outfile.write(a)
outfile.close()

Usage:

> VCF2BIN ./path/to/vcf.vcf

Example Output:

@HD     VN:1.0
@SQ     SN:chr1 LN:199141577
@SQ     SN:chr1_GL456210_random LN:172527
@SQ     SN:chr1_GL456211_random LN:235147
@SQ     SN:chr1_GL456212_random LN:152998
@SQ     SN:chr1_GL456221_random LN:209299
@SQ     SN:chr2 LN:184859577
@SQ     SN:chr3 LN:162759764
@SQ     SN:chr4 LN:159434297
...
@SQ     SN:chrUn_GL456396       LN:40612
@SQ     SN:chrUn_JH584304       LN:71575
@SQ     SN:chrX LN:176147421
@SQ     SN:chrX_GL456233_random LN:366350
@SQ     SN:chrY LN:91586428
@SQ     SN:chrY_JH584300_random LN:65102
@SQ     SN:chrY_JH584301_random LN:80372
@SQ     SN:chrY_JH584302_random LN:150326
read_9201514959 1567    chr4_GL456350_random    66052   20      62M     =       778     62      TCGCTCNCNACGAGTATCTCGAGNCGGNGNGATAAAGAATTNNCGGTCCNCTTGCCANCGTN  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  NM:i:1  RG:Z:L1
read_9162951900 1957    chrY_JH584302_random    5022    20      73M     =       778     73      GGNCANAACCTTCTNCACTGNTCTAATCGCNCGTANNTTACCTTTTCNCTAAGTANATNCTNTGGCTTCACTA       IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII       NM:i:1  RG:Z:L1
read_0771262576 1928    chr9    41692370        20      47M     =       778     47      TNTCNANTTNTGGTCCNCCGTACAANTGNACAANNNCANTAATNAAA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:1  RG:Z:L1
read_2008154340 2796    chrUn_GL456383  42319   20      41M     =       778     41      CNTGTNCCGACTCATGCAGGNGATNNTGTATGGGAGCTNCG       IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII       NM:i:1  RG:Z:L1
read_5327352072 2985    chrUn_GL456367  11976   20      48M     =       778     48      CNCACNGCTTNANACTGATCGNNCNCNNNNAGNACAATTNGANAGTNC        IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII        NM:i:1  RG:Z:L1
read_2515573039 3633    chr4_JH584295_random    5998    20      74M     =       778     74      TNGTTTCTAGCGCNNTGAGNGAGNNGACATCANANNGCACGCTTGNTAGACANCNAGTGGTGGCCTGGACTCTC      IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII      NM:i:1  RG:Z:L1
read_9171559801 783     chrUn_GL456389  17972   20      61M     =       778     61      CNGGGTNCCNTTAACCCTTNATTTNAANANCATNTTAAGNGGANCGTTCGTNTTTCNAACC   IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII   NM:i:1  RG:Z:L1
read_5146518951 909     chr4_JH584295_random    9303    20      41M     =       778     41      GTACANGAGNNTCANNCTACNACNGAGATCCNGGCTTATNA       IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII       NM:i:1  RG:Z:L1
read_5977278670 83      chrUn_GL456385  43659   20      74M     =       778     74      CGNGGNCCCAAATCGTCANCAGCTGGAANGNACAAGNGGTGGANGGTTTGTTTNTGNTATNAAGNTATNGTGNC      IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII      NM:i:1  RG:Z:L1
...

This tool is significantly faster than all other answers provided, because it uses aggressive subsampling of the VCF. In fact it only samples the header. This will not be addressed in the next version, but v2.0 will support parallelism on Casio graphical calculators, and the option to automatically write the output to /dev/null where it belongs.

ADD COMMENTlink modified 11 months ago by RamRS30k • written 5.1 years ago by John12k
1

Finally all my Casio calculators will be put back into use! Thanks.

ADD REPLYlink written 5.1 years ago by h.mon31k
2
gravatar for Shicheng Guo
5.1 years ago by
Shicheng Guo8.4k
Shicheng Guo8.4k wrote:

It depend what you want to use the Bam for. VCF is derived from Bam while VCF lost some information of Bam which include 1) read size 2) read depth. There, if you want to do it, the bam will not contain these two important information. and such Bam can be called Vcf drove simulated Bam file. 

ADD COMMENTlink written 5.1 years ago by Shicheng Guo8.4k

I want to use the derived .bam file in ANGSD for Tajima's D analysis.

ADD REPLYlink written 5.1 years ago by Madhuri Haque20
1
gravatar for bioguy24
5.1 years ago by
bioguy24190
Chicago
bioguy24190 wrote:

To get a vcf derived bam try:

bedToBam -ubam -g file.genome -i snp.vcf  > snp.bam
ADD COMMENTlink modified 5 weeks ago by RamRS30k • written 5.1 years ago by bioguy24190
0
gravatar for Jorge Amigo
5.1 years ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

conceptually you can't. a .vcf file contains only variants, and a .bam file contains reads coming from a sequencer, usually aligned against a reference. if you want to get the corresponding .bam files you should be able to get them from the same source you got the .vcf files. if you still want to generate a .bam file containing with the reads that would contain the variants on that .vcf file you can use a few software solutions available, although it will depend a lot on what you are really wanting that .bam file to.

ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by Jorge Amigo12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 844 users visited in the last hour