Question: Convert specific regions in bam file to vcf and annotating the vcf
gravatar for komal.rathi
3.9 years ago by
Children's Hospital of Philadelphia, Philadelphia, PA
komal.rathi3.6k wrote:


I have a couple of bam files from exome sequencing and a list of regions that I want to perform variant calling and variant annotation on. My goal is to look for evidence of mutations in those regions that go beyond canonical LOF protein coding mutations.

I am thinking of extracting the regions of interest from a bam, converting to a vcf and then annotating it using snpEff or something like that. The following is my commandline:

# add read groups 
picard AddOrReplaceReadGroups I=sample.bam O=sample.fixed.bam RGID=4 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=20

# run gatk 
java GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37.fasta
-I sample.fixed.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o sample.gvcf

# run snpeff 
snpeff -c snpEff.GRCh37.config -ud 10 -classic GRCh37.75 sample.gvcf > sample.snpeff.gvcf

Q1. Is my approach correct? Is this approach applicable to WGS and RNASeq as well?

My vcf file generated in this manner has missing ALT info - it says <non_ref>. How can I fix this?

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  20
chr4    54243811        .       C       <NON_REF>       .       .       END=54243904    GT:DP:GQ:MIN_DP:PL      0/0:208:99:129:0,120,1800
chr4    54243905        .       T       C,<NON_REF>     10162.77        .       DP=280;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.94       GT:AD:DP:GQ:P
L:SB       1/1:0,268,0:268:99:10191,805,0,10191,805,10191:0,0,160,108
chr4    54243906        .       C       <NON_REF>       .       .       END=54244236    GT:DP:GQ:MIN_DP:PL      0/0:170:99:37:0,99,1485
chr4    54244237        .       T       <NON_REF>       .       .       END=54244238    GT:DP:GQ:MIN_DP:PL      0/0:36:90:35:0,90,1350
chr4    54244239        .       C       <NON_REF>       .       .       END=54244241    GT:DP:GQ:MIN_DP:PL      0/0:34:72:34:0,72,1080
chr4    54244242        .       G       <NON_REF>       .       .       END=54244244    GT:DP:GQ:MIN_DP:PL      0/0:33:63:33:0,63,945
chr4    54244245        .       C       <NON_REF>       .       .       END=54244247    GT:DP:GQ:MIN_DP:PL      0/0:41:78:39:0,78,1170

Q2. Also, can someone suggest a better approach for creating a vcf for just specific regions of bam file?

subset sequencing bam vcf exome • 1.2k views
ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by komal.rathi3.6k

Why don't you call all SNPs and extract SNPs overlapping regions of interest using intersectBed ?

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by geek_y11k

I tried the method below (adding -L parameter to GATK) and it is really fast compared to calling all the SNPs and then intersecting.

ADD REPLYlink written 3.9 years ago by komal.rathi3.6k
gravatar for WouterDeCoster
3.9 years ago by
WouterDeCoster44k wrote:

You can use the -L flag of HaplotypeCaller to limit the intervals in which variants are detected as specified by a bed file.

--intervals / -L One or more genomic intervals over which to operate Use this option to perform the analysis over only part of the genome. This argument can be specified multiple times. You can use samtools-style intervals either explicitly on the command line (e.g. -L chr1 or -L chr1:100-200) or by loading in a file containing a list of intervals (e.g. -L myFile.intervals). Additionally, you can also specify a ROD file (such as a VCF file) in order to perform the analysis at specific positions based on the records present in the file (e.g. -L file.vcf). Finally, you can also use this to perform the analysis on the reads that are completely unmapped in the BAM file (i.e. those without a reference contig) by specifying -L unmapped.


ADD COMMENTlink written 3.9 years ago by WouterDeCoster44k

Thank you! That worked and is quite fast because I was only looking at a small number of genes.

ADD REPLYlink written 3.9 years ago by komal.rathi3.6k

This link doesn't work

ADD REPLYlink written 19 months ago by Mehulsharma.25310
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 811 users visited in the last hour