mm10 SNP/Indel database for realignment around indels
1
2
Entering edit mode
5.1 years ago
clfougner ▴ 70

Hi,

I'm following the Broad Institute's best practices for variant calling using the Genome Analysis Toolkit, and am having issues with the realignment around indels. Specifically, my issue pertains to the use of the RealignerTargetCreator.

Briefly, my preprocessing so far consists of mapping the reads to the mm10.fasta file (downloaded from UCSC) using BWA-mem, realigning the aligned BAM file by coordinate using Picard's SortSam function, and dedupping with Picard's MarkDuplicates function. I'm using data from FVB/N mice, sequenced using Agilent's Mouse All Exon Kit on an Illumina HiSeq.

In order to realign around indels, the RealignerTargetCreator tool from GATK must first be used (see documentation here). In order to do this I use the following code:

java -jar GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R mm10.fa \
-I input.bam \
-known mgp.v3.indels.rsIDdbSNPv137.vcf \
-o output.list


The "known" file is downloaded from the Sanger FTP, and is aligned to GRCm38, which as far as I understand should be identical to mm10. This however leads to the following error:

##### ERROR MESSAGE: Input files /Users/Christian/Documents/Forskerlinja/DMBA-indusert/Sequencing/ReferenceFiles/mgp.v3.indels.rsIDdbSNPv137.vcf and reference have incompatible contigs. Please see http://gatkforums.broadinstitute.org/discussion/63/input-files-have-incompatible-contigsfor more information. Error details: No overlapping contigs found.
##### ERROR   /Users/Christian/Documents/Forskerlinja/DMBA-indusert/Sequencing/ReferenceFiles/mgp.v3.indels.rsIDdbSNPv137.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, X]
##### ERROR   reference contigs = [chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr1_GL456210_random, chr1_GL456211_random, chr1_GL456212_random, chr1_GL456213_random, chr1_GL456221_random, chr2, chr3, chr4, chr4_GL456216_random, chr4_JH584292_random, chr4_GL456350_random, chr4_JH584293_random, chr4_JH584294_random, chr4_JH584295_random, chr5, chr5_JH584296_random, chr5_JH584297_random, chr5_JH584298_random, chr5_GL456354_random, chr5_JH584299_random, chr6, chr7, chr7_GL456219_random, chr8, chr9, chrM, chrX, chrX_GL456233_random, chrY, chrY_JH584300_random, chrY_JH584301_random, chrY_JH584302_random, chrY_JH584303_random, chrUn_GL456239, chrUn_GL456367, chrUn_GL456378, chrUn_GL456381, chrUn_GL456382, chrUn_GL456383, chrUn_GL456385, chrUn_GL456390, chrUn_GL456392, chrUn_GL456393, chrUn_GL456394, chrUn_GL456359, chrUn_GL456360, chrUn_GL456396, chrUn_GL456372, chrUn_GL456387, chrUn_GL456389, chrUn_GL456370, chrUn_GL456379, chrUn_GL456366, chrUn_GL456368, chrUn_JH584304]


So in order to fix this, I've considered two possible solutions. The GATK website recommends doing a liftover using Picard's LiftoverVCF; I don't see this as being possible, seeing as it's already meant to be using the correct reference genome. Second, I've tried using the method suggested by Ashutosh Pandey, here; as far as I can tell this however simply appends "chr" to the chromosome number, but it doesn't remedy the issue with the chrUn_xxxxx(_random) chromosomes.

So, does anyone have any recommendations as to how I can get the indel file to work with sequence data aligned to mm10?

Thanks!

Note: I've also considered using the FVB/N specific list of Indels (and SNPs) found here, which would likely be sufficient for my purposes. It is however aligned to mm9, and I have tried, unsuccessfully, to use Picard's LiftoverVcf and the "mm9ToMm10.over.chain" file, and the mm10.fa reference. While this doesn't give me an error per se, these are the results of the liftover:

INFO    2016-03-22 22:05:46 LiftoverVcf Processed 5137790 variants.
INFO    2016-03-22 22:05:46 LiftoverVcf 5137790 variants failed to liftover.
INFO    2016-03-22 22:05:46 LiftoverVcf 0 variants lifted over but had mismatching reference alleles after lift over.
INFO    2016-03-22 22:05:46 LiftoverVcf 100.0000% of variants were not successfully lifted over and written to the output.


If anyone believes pursuing this approach would be better, I'd be happy to hear any suggestions as to why the Liftover may not be working.

mouse exome sequencing dbSNP GATK • 4.3k views
1
Entering edit mode
5.1 years ago
John 13k

Try the following:

wget -O ./mm10.INDELS.dbSNP142.vcf.gz ftp://ftp-mouse.sanger.ac.uk/current_indels/strain_specific_vcfs/C57BL_6NJ.mgp.v5.indels.dbSNP142.normed.vcf.gz

gunzip ./mm10.INDELS.dbSNP142.vcf.gz

awk '{ if($0 !~ /^#/) print "chr"$0; else if(match($0,/(##contig=<ID=)(.*) ,m))="" print="" m[1]"chr"m[2];="" else="" print=""$0="" }'="" mm10.INDELS.dbSNP142.vcf=""> mm10.INDELS.vcf

java -jar /path/to/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /path/to/mm10.fa -I /path/to/data.bam -known ./mm10.INDELS.vcf -o /path/to/data.realignment.targets


If that seems like suspiciously specific advice, its because I did exactly this just the other day and logged it all on log.bio: The first node has event ID fWRhFQ if you want to follow along and see all the downstream steps after that :)

Here is the commands i used to make mm10.fa:

wget -O ./mm10.2bit hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/mm10.2bit
./twoBitToFa /ref/mm10.2bit /ref/mm10.fa

1
Entering edit mode

Thanks so much for the answer! Unfortunately, when I run this code:

awk '{ if($0 !~ /^#/) print "chr"$0; else if(match($0,/(##contig=<ID=)(.*) ,m))="" print="" m[1]"chr"m[2];="" else="" print=""$0="" }'="" mm10.INDELS.dbSNP142.vcf=""> mm10.INDELS.vcf


I get the following error:

awk: non-terminated regular expression (##contig=... at source line 1
context is
>>>  <<<
awk: syntax error in regular expression (##contig=<ID=)(.*) ,m))="" print="" m[1]"chr"m[2];="" else="" print="" $0="" }= at )="" print="" m[1]"chr"m[2];="" else="" print=""$0="" }=
source line number 1


I'm looking into the issue, but I'm unfamiliar with the 'awk' command, so I haven't figured it out yet. Any idea what the issue might be?

0
Entering edit mode

Hm, somehow the site's formatting makes that awk command go crazy. Here it is again, outside of the code block:

awk '{ if($0 !~ /^#/) print "chr"$0; else if(match($0,/(##contig=<id=)(.*) ,m))="" print="" m[1]"chr"m[2];="" else="" print=""$0="" }'="" mm10.indels.dbsnp142.vcf="" &gt;="" mm10.indels.vcf<="" p="">

1
Entering edit mode

OK well that didn't fix it, and more importantly the content changed even though it wasnt in a code block this time. Here's the "raw" command: http://paste.ofcode.org/m9HQPEDfsUkfQzYgk5uGYQ

...as if i needed another reason to avoid awk -_-

0
Entering edit mode

Hmm, I'm still getting an error when I try running it:

awk: syntax error at source line 1
context is
{ if($0 !~ /^#/) print "chr"$0; else >>>  if(match(\$0,/(##contig=<ID=)(.*)/, <<<
awk: illegal statement at source line 1
awk: illegal statement at source line 1


Any other ideas? Maybe take a screenshot, and I can type it myself? Thanks again for the help!!

1
Entering edit mode

I don't know why it isn't working man, sorry :(

Fullscreen: http://i.imgur.com/HvxB7ji.png

1
Entering edit mode

Success! Don't know what the issue was, but I ran the code from paste.ofcode on my lab's server (instead of my own computer) and it worked, so it must have been something with my runtime environment. Trying the RealignerTargetCreator now and it seems to be working. Thanks a lot!

0
Entering edit mode

Another issue is the version of awk/mawk/gawk.

For me, gawk works, mawk does not.