finding compatible version of reference genome for use of GATK
4
2
Entering edit mode
2.4 years ago

I am mapping a series of fastq files, containing ChiP-Seq data, with bowtie2 (my mapping software of choice) to the GRCh37 genome: https://www.gencodegenes.org/human/release_39lift37.html . I have mapped as follows:

bwa mem GRCh37.primary_assembly.genome.fa ${srr}.fastq.gz | samtools view -S -b -q 10 - > ${srr}.bam

I then create the fasta file for the bt2 reference and index it:

samtools faidx GRCh37.primary_assembly.genome.fa
java -jar ~/picard.jar CreateSequenceDictionary -R GRCh37.primary_assembly.genome.fa

However, when I then try to call variants in this data I get an error:

java -jar ~/picard.jar MarkDuplicates -I ${srr}_sorted.bam -M metrics.txt -O ${srr}_sorted_dedup.bam
gatk --java-options "-Xms10G -Xmx10G -XX:ParallelGCThreads=2" BaseRecalibrator \
    -I ${srr}_sorted_dedup.bam \
    -R GRCh37.primary_assembly.genome.fa \
    -O ${srr}_sorted_dedup.report \
    --known-sites variants.vcf

The marking of the duplicates part seems to work fine but for the recalibrating the bases part the contigs do not seem to match up:

A USER ERROR has occurred: Input files reference and features have incompatible contigs: Found 

contigs with the same name but different lengths:
  contig reference = 10 / 135534747
  contig features = 10 / 135284542.
   reference contigs = [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 1, 20, 21, 22, 2, 3, 4, 5, 6, 7, 8, 9, MT, X, Y]
   features contigs = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, XY, M]

I have read that the vcf file that I am using was made according to GRCh37: https://www.synapse.org/#!Synapse:syn10901595

I downloaded the plink files from https://www.synapse.org/#!Synapse:syn17008939 and made the vcf file as follows:

plink --bfile ROSMAP_data --recode vcf --output-chr M 

Hence I am wondering how the vcf file can mismatch the ref genome when they are the same- has anyone experienced this problem before?

Also just check that the ref genome should not be b37 does anyone know of a link to the b37 fasta file? The following link which I tried- leads to a NO PERMISSION page for me:

https://console.cloud.google.com/storage/browser/_details/broad-references/hg19/v0/Homo_sapiens_assembly19.fasta;tab=live_object 
bowtie2 GATK • 2.0k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
2
Entering edit mode
2.4 years ago

I solved this problem by finding the correct liftover chain file to convert my vcf/known-sites file to a version that was compatible with my reference genome. The README that described the making of this vcf file states that it was created using the GRCh37 genome as the reference- when in fact it was the b37 genome- hence:

java -jar picard.jar LiftoverVcf \\
 I=variants.vcf \\
 O=lifted_over.vcf \\
 CHAIN=b37tohg19.chain \\
 REJECT=rejected_variants.vcf \\
 R=GRCh37.primary_assembly.genome.fa

And then the base recalibration step worked.

NOTE: The chain file is downloaded from https://github.com/broadgsa/gatk/tree/master/public/chainFiles

ADD COMMENT
1
Entering edit mode
2.4 years ago

the reference/bam file have a different sequence dictionary than the VCF file ( for example there is chromosome "M" vs "MT" ). Both files should use the very same dictionary/reference.

https://www.google.com/search?q=%22Input+files+reference+and+features+have+incompatible+contigs%22

ADD COMMENT
0
Entering edit mode
2.4 years ago

To avoid issues due to incompatibility between files when using GATK tools, download the resource files (e.g. genome FASTA/ dict) from https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle.

ADD COMMENT
0
Entering edit mode
2.4 years ago
Divon ▴ 230

You could use Genozip (of which I am the author) to update the VCF file's CHROMs to match those of the reference file:

genozip --match-chrom-to-reference --reference <ref-file> myfile.vcf
genocat myfile.vcf.genozip

See: https://genozip.com/genozip.html and https://genozip.com/vcf.html

ADD COMMENT

Login before adding your answer.

Traffic: 1425 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6