I am mapping a series of fastq files, containing ChiP-Seq data, with bowtie2 (my mapping software of choice) to the GRCh37 genome: https://www.gencodegenes.org/human/release_39lift37.html
. I have mapped as follows:
bwa mem GRCh37.primary_assembly.genome.fa ${srr}.fastq.gz | samtools view -S -b -q 10 - > ${srr}.bam
I then create the fasta file for the bt2 reference and index it:
samtools faidx GRCh37.primary_assembly.genome.fa
java -jar ~/picard.jar CreateSequenceDictionary -R GRCh37.primary_assembly.genome.fa
However, when I then try to call variants in this data I get an error:
java -jar ~/picard.jar MarkDuplicates -I ${srr}_sorted.bam -M metrics.txt -O ${srr}_sorted_dedup.bam
gatk --java-options "-Xms10G -Xmx10G -XX:ParallelGCThreads=2" BaseRecalibrator \
-I ${srr}_sorted_dedup.bam \
-R GRCh37.primary_assembly.genome.fa \
-O ${srr}_sorted_dedup.report \
--known-sites variants.vcf
The marking of the duplicates part seems to work fine but for the recalibrating the bases part the contigs do not seem to match up:
A USER ERROR has occurred: Input files reference and features have incompatible contigs: Found
contigs with the same name but different lengths:
contig reference = 10 / 135534747
contig features = 10 / 135284542.
reference contigs = [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 1, 20, 21, 22, 2, 3, 4, 5, 6, 7, 8, 9, MT, X, Y]
features contigs = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, XY, M]
I have read that the vcf file that I am using was made according to GRCh37: https://www.synapse.org/#!Synapse:syn10901595
I downloaded the plink files from https://www.synapse.org/#!Synapse:syn17008939 and made the vcf file as follows:
plink --bfile ROSMAP_data --recode vcf --output-chr M
Hence I am wondering how the vcf file can mismatch the ref genome when they are the same- has anyone experienced this problem before?
Also just check that the ref genome should not be b37 does anyone know of a link to the b37 fasta file? The following link which I tried- leads to a NO PERMISSION page for me:
https://console.cloud.google.com/storage/browser/_details/broad-references/hg19/v0/Homo_sapiens_assembly19.fasta;tab=live_object
You can get GRCh37 files from: https://www.gencodegenes.org/human/release_39lift37.html