Question: mm10 SNP/Indel database for realignment around indels
gravatar for clfougner
2.3 years ago by
Institute for Cancer Research, Oslo University Hospital, Norway
clfougner70 wrote:


I'm following the Broad Institute's best practices for variant calling using the Genome Analysis Toolkit, and am having issues with the realignment around indels. Specifically, my issue pertains to the use of the RealignerTargetCreator.

Briefly, my preprocessing so far consists of mapping the reads to the mm10.fasta file (downloaded from UCSC) using BWA-mem, realigning the aligned BAM file by coordinate using Picard's SortSam function, and dedupping with Picard's MarkDuplicates function. I'm using data from FVB/N mice, sequenced using Agilent's Mouse All Exon Kit on an Illumina HiSeq.

In order to realign around indels, the RealignerTargetCreator tool from GATK must first be used (see documentation here). In order to do this I use the following code:

java -jar GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R mm10.fa \
-I input.bam \
-known mgp.v3.indels.rsIDdbSNPv137.vcf \
-o output.list

The "known" file is downloaded from the Sanger FTP, and is aligned to GRCm38, which as far as I understand should be identical to mm10. This however leads to the following error:

##### ERROR MESSAGE: Input files /Users/Christian/Documents/Forskerlinja/DMBA-indusert/Sequencing/ReferenceFiles/mgp.v3.indels.rsIDdbSNPv137.vcf and reference have incompatible contigs. Please see more information. Error details: No overlapping contigs found.
##### ERROR   /Users/Christian/Documents/Forskerlinja/DMBA-indusert/Sequencing/ReferenceFiles/mgp.v3.indels.rsIDdbSNPv137.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, X]
##### ERROR   reference contigs = [chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr1_GL456210_random, chr1_GL456211_random, chr1_GL456212_random, chr1_GL456213_random, chr1_GL456221_random, chr2, chr3, chr4, chr4_GL456216_random, chr4_JH584292_random, chr4_GL456350_random, chr4_JH584293_random, chr4_JH584294_random, chr4_JH584295_random, chr5, chr5_JH584296_random, chr5_JH584297_random, chr5_JH584298_random, chr5_GL456354_random, chr5_JH584299_random, chr6, chr7, chr7_GL456219_random, chr8, chr9, chrM, chrX, chrX_GL456233_random, chrY, chrY_JH584300_random, chrY_JH584301_random, chrY_JH584302_random, chrY_JH584303_random, chrUn_GL456239, chrUn_GL456367, chrUn_GL456378, chrUn_GL456381, chrUn_GL456382, chrUn_GL456383, chrUn_GL456385, chrUn_GL456390, chrUn_GL456392, chrUn_GL456393, chrUn_GL456394, chrUn_GL456359, chrUn_GL456360, chrUn_GL456396, chrUn_GL456372, chrUn_GL456387, chrUn_GL456389, chrUn_GL456370, chrUn_GL456379, chrUn_GL456366, chrUn_GL456368, chrUn_JH584304]

So in order to fix this, I've considered two possible solutions. The GATK website recommends doing a liftover using Picard's LiftoverVCF; I don't see this as being possible, seeing as it's already meant to be using the correct reference genome. Second, I've tried using the method suggested by Ashutosh Pandey, here; as far as I can tell this however simply appends "chr" to the chromosome number, but it doesn't remedy the issue with the chrUn_xxxxx(_random) chromosomes.

So, does anyone have any recommendations as to how I can get the indel file to work with sequence data aligned to mm10?


Note: I've also considered using the FVB/N specific list of Indels (and SNPs) found here, which would likely be sufficient for my purposes. It is however aligned to mm9, and I have tried, unsuccessfully, to use Picard's LiftoverVcf and the "mm9ToMm10.over.chain" file, and the mm10.fa reference. While this doesn't give me an error per se, these are the results of the liftover:

INFO    2016-03-22 22:05:46 LiftoverVcf Processed 5137790 variants.
INFO    2016-03-22 22:05:46 LiftoverVcf 5137790 variants failed to liftover.
INFO    2016-03-22 22:05:46 LiftoverVcf 0 variants lifted over but had mismatching reference alleles after lift over.
INFO    2016-03-22 22:05:46 LiftoverVcf 100.0000% of variants were not successfully lifted over and written to the output.

If anyone believes pursuing this approach would be better, I'd be happy to hear any suggestions as to why the Liftover may not be working.

gatk dbsnp sequencing mouse exome • 1.8k views
ADD COMMENTlink modified 2.3 years ago by John12k • written 2.3 years ago by clfougner70
gravatar for John
2.3 years ago by
John12k wrote:

Try the following:

wget -O ./mm10.INDELS.dbSNP142.vcf.gz

gunzip ./mm10.INDELS.dbSNP142.vcf.gz

awk '{ if($0 !~ /^#/) print "chr"$0; else if(match($0,/(##contig=<ID=)(.*) ,m))="" print="" m[1]"chr"m[2];="" else="" print="" $0="" }'="" mm10.INDELS.dbSNP142.vcf=""> mm10.INDELS.vcf

java -jar /path/to/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /path/to/mm10.fa -I /path/to/data.bam -known ./mm10.INDELS.vcf -o /path/to/data.realignment.targets

If that seems like suspiciously specific advice, its because I did exactly this just the other day and logged it all on The first node has event ID fWRhFQ if you want to follow along and see all the downstream steps after that :) enter image description here

Here is the commands i used to make mm10.fa:

wget -O ./mm10.2bit 
wget -O ./twoBitToFa 
./twoBitToFa /ref/mm10.2bit /ref/mm10.fa
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by John12k

Thanks so much for the answer! Unfortunately, when I run this code:

awk '{ if($0 !~ /^#/) print "chr"$0; else if(match($0,/(##contig=<ID=)(.*) ,m))="" print="" m[1]"chr"m[2];="" else="" print="" $0="" }'="" mm10.INDELS.dbSNP142.vcf=""> mm10.INDELS.vcf

I get the following error:

awk: non-terminated regular expression (##contig=... at source line 1
 context is
     >>>  <<< 
awk: syntax error in regular expression (##contig=<ID=)(.*) ,m))="" print="" m[1]"chr"m[2];="" else="" print="" $0="" }= at )="" print="" m[1]"chr"m[2];="" else="" print="" $0="" }=
 source line number 1

I'm looking into the issue, but I'm unfamiliar with the 'awk' command, so I haven't figured it out yet. Any idea what the issue might be?

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by clfougner70

Hm, somehow the site's formatting makes that awk command go crazy. Here it is again, outside of the code block:

awk '{ if($0 !~ /^#/) print "chr"$0; else if(match($0,/(##contig=<id=)(.*) ,m))="" print="" m[1]"chr"m[2];="" else="" print="" $0="" }'="" mm10.indels.dbsnp142.vcf="" &gt;="" mm10.indels.vcf<="" p="">

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by John12k

OK well that didn't fix it, and more importantly the content changed even though it wasnt in a code block this time. Here's the "raw" command: if i needed another reason to avoid awk -_-

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by John12k

Hmm, I'm still getting an error when I try running it:

awk: syntax error at source line 1
 context is
    { if($0 !~ /^#/) print "chr"$0; else >>>  if(match($0,/(##contig=<ID=)(.*)/, <<< 
awk: illegal statement at source line 1
awk: illegal statement at source line 1

Any other ideas? Maybe take a screenshot, and I can type it myself? Thanks again for the help!!

ADD REPLYlink written 2.3 years ago by clfougner70

I don't know why it isn't working man, sorry :(

enter image description here


ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by John12k

Success! Don't know what the issue was, but I ran the code from paste.ofcode on my lab's server (instead of my own computer) and it worked, so it must have been something with my runtime environment. Trying the RealignerTargetCreator now and it seems to be working. Thanks a lot!

ADD REPLYlink written 2.3 years ago by clfougner70
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2071 users visited in the last hour