Question: mm10 SNP/Indel database for realignment around indels
1
gravatar for clfougner
20 months ago by
clfougner60
Institute for Cancer Research, Oslo University Hospital, Norway
clfougner60 wrote:

Hi,

I'm following the Broad Institute's best practices for variant calling using the Genome Analysis Toolkit, and am having issues with the realignment around indels. Specifically, my issue pertains to the use of the RealignerTargetCreator.

Briefly, my preprocessing so far consists of mapping the reads to the mm10.fasta file (downloaded from UCSC) using BWA-mem, realigning the aligned BAM file by coordinate using Picard's SortSam function, and dedupping with Picard's MarkDuplicates function. I'm using data from FVB/N mice, sequenced using Agilent's Mouse All Exon Kit on an Illumina HiSeq.

In order to realign around indels, the RealignerTargetCreator tool from GATK must first be used (see documentation here). In order to do this I use the following code:

java -jar GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R mm10.fa \
-I input.bam \
-known mgp.v3.indels.rsIDdbSNPv137.vcf \
-o output.list

The "known" file is downloaded from the Sanger FTP, and is aligned to GRCm38, which as far as I understand should be identical to mm10. This however leads to the following error:

##### ERROR MESSAGE: Input files /Users/Christian/Documents/Forskerlinja/DMBA-indusert/Sequencing/ReferenceFiles/mgp.v3.indels.rsIDdbSNPv137.vcf and reference have incompatible contigs. Please see http://gatkforums.broadinstitute.org/discussion/63/input-files-have-incompatible-contigsfor more information. Error details: No overlapping contigs found.
##### ERROR   /Users/Christian/Documents/Forskerlinja/DMBA-indusert/Sequencing/ReferenceFiles/mgp.v3.indels.rsIDdbSNPv137.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, X]
##### ERROR   reference contigs = [chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr1_GL456210_random, chr1_GL456211_random, chr1_GL456212_random, chr1_GL456213_random, chr1_GL456221_random, chr2, chr3, chr4, chr4_GL456216_random, chr4_JH584292_random, chr4_GL456350_random, chr4_JH584293_random, chr4_JH584294_random, chr4_JH584295_random, chr5, chr5_JH584296_random, chr5_JH584297_random, chr5_JH584298_random, chr5_GL456354_random, chr5_JH584299_random, chr6, chr7, chr7_GL456219_random, chr8, chr9, chrM, chrX, chrX_GL456233_random, chrY, chrY_JH584300_random, chrY_JH584301_random, chrY_JH584302_random, chrY_JH584303_random, chrUn_GL456239, chrUn_GL456367, chrUn_GL456378, chrUn_GL456381, chrUn_GL456382, chrUn_GL456383, chrUn_GL456385, chrUn_GL456390, chrUn_GL456392, chrUn_GL456393, chrUn_GL456394, chrUn_GL456359, chrUn_GL456360, chrUn_GL456396, chrUn_GL456372, chrUn_GL456387, chrUn_GL456389, chrUn_GL456370, chrUn_GL456379, chrUn_GL456366, chrUn_GL456368, chrUn_JH584304]

So in order to fix this, I've considered two possible solutions. The GATK website recommends doing a liftover using Picard's LiftoverVCF; I don't see this as being possible, seeing as it's already meant to be using the correct reference genome. Second, I've tried using the method suggested by Ashutosh Pandey, here; as far as I can tell this however simply appends "chr" to the chromosome number, but it doesn't remedy the issue with the chrUn_xxxxx(_random) chromosomes.

So, does anyone have any recommendations as to how I can get the indel file to work with sequence data aligned to mm10?

Thanks!

Note: I've also considered using the FVB/N specific list of Indels (and SNPs) found here, which would likely be sufficient for my purposes. It is however aligned to mm9, and I have tried, unsuccessfully, to use Picard's LiftoverVcf and the "mm9ToMm10.over.chain" file, and the mm10.fa reference. While this doesn't give me an error per se, these are the results of the liftover:

INFO    2016-03-22 22:05:46 LiftoverVcf Processed 5137790 variants.
INFO    2016-03-22 22:05:46 LiftoverVcf 5137790 variants failed to liftover.
INFO    2016-03-22 22:05:46 LiftoverVcf 0 variants lifted over but had mismatching reference alleles after lift over.
INFO    2016-03-22 22:05:46 LiftoverVcf 100.0000% of variants were not successfully lifted over and written to the output.

If anyone believes pursuing this approach would be better, I'd be happy to hear any suggestions as to why the Liftover may not be working.

gatk dbsnp sequencing mouse exome • 1.2k views
ADD COMMENTlink modified 20 months ago by John12k • written 20 months ago by clfougner60
1
gravatar for John
20 months ago by
John12k
Germany
John12k wrote:

Try the following:

wget -O ./mm10.INDELS.dbSNP142.vcf.gz ftp://ftp-mouse.sanger.ac.uk/current_indels/strain_specific_vcfs/C57BL_6NJ.mgp.v5.indels.dbSNP142.normed.vcf.gz

gunzip ./mm10.INDELS.dbSNP142.vcf.gz

awk '{ if($0 !~ /^#/) print "chr"$0; else if(match($0,/(##contig=<ID=)(.*) ,m))="" print="" m[1]"chr"m[2];="" else="" print="" $0="" }'="" mm10.INDELS.dbSNP142.vcf=""> mm10.INDELS.vcf

java -jar /path/to/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /path/to/mm10.fa -I /path/to/data.bam -known ./mm10.INDELS.vcf -o /path/to/data.realignment.targets

If that seems like suspiciously specific advice, its because I did exactly this just the other day and logged it all on log.bio: The first node has event ID fWRhFQ if you want to follow along and see all the downstream steps after that :) enter image description here

Here is the commands i used to make mm10.fa:

wget -O ./mm10.2bit hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/mm10.2bit 
wget -O ./twoBitToFa hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa 
./twoBitToFa /ref/mm10.2bit /ref/mm10.fa
ADD COMMENTlink modified 20 months ago • written 20 months ago by John12k
1

Thanks so much for the answer! Unfortunately, when I run this code:

awk '{ if($0 !~ /^#/) print "chr"$0; else if(match($0,/(##contig=<ID=)(.*) ,m))="" print="" m[1]"chr"m[2];="" else="" print="" $0="" }'="" mm10.INDELS.dbSNP142.vcf=""> mm10.INDELS.vcf

I get the following error:

awk: non-terminated regular expression (##contig=... at source line 1
 context is
     >>>  <<< 
awk: syntax error in regular expression (##contig=<ID=)(.*) ,m))="" print="" m[1]"chr"m[2];="" else="" print="" $0="" }= at )="" print="" m[1]"chr"m[2];="" else="" print="" $0="" }=
 source line number 1

I'm looking into the issue, but I'm unfamiliar with the 'awk' command, so I haven't figured it out yet. Any idea what the issue might be?

ADD REPLYlink modified 20 months ago • written 20 months ago by clfougner60

Hm, somehow the site's formatting makes that awk command go crazy. Here it is again, outside of the code block:

awk '{ if($0 !~ /^#/) print "chr"$0; else if(match($0,/(##contig=<id=)(.*) ,m))="" print="" m[1]"chr"m[2];="" else="" print="" $0="" }'="" mm10.indels.dbsnp142.vcf="" &gt;="" mm10.indels.vcf<="" p="">

ADD REPLYlink modified 20 months ago • written 20 months ago by John12k
1

OK well that didn't fix it, and more importantly the content changed even though it wasnt in a code block this time. Here's the "raw" command: http://paste.ofcode.org/m9HQPEDfsUkfQzYgk5uGYQ

...as if i needed another reason to avoid awk -_-

ADD REPLYlink modified 20 months ago • written 20 months ago by John12k

Hmm, I'm still getting an error when I try running it:

awk: syntax error at source line 1
 context is
    { if($0 !~ /^#/) print "chr"$0; else >>>  if(match($0,/(##contig=<ID=)(.*)/, <<< 
awk: illegal statement at source line 1
awk: illegal statement at source line 1

Any other ideas? Maybe take a screenshot, and I can type it myself? Thanks again for the help!!

ADD REPLYlink written 20 months ago by clfougner60
1

I don't know why it isn't working man, sorry :(

enter image description here

Fullscreen: http://i.imgur.com/HvxB7ji.png

ADD REPLYlink modified 20 months ago • written 20 months ago by John12k
1

Success! Don't know what the issue was, but I ran the code from paste.ofcode on my lab's server (instead of my own computer) and it worked, so it must have been something with my runtime environment. Trying the RealignerTargetCreator now and it seems to be working. Thanks a lot!

ADD REPLYlink written 20 months ago by clfougner60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1380 users visited in the last hour