Question: GATK BaseRecalibrator Step
0
gravatar for Seigfried
6 months ago by
Seigfried70
Seigfried70 wrote:

Hello

I have been struggling with GATK 4 and feeling an acute lack of the command line pipelines available in previous versions. I took a look at the Tool Index and tried to make out the commands available from the https://github.com/gatk-workflows/gatk4-rnaseq-germline-snps-indels/blob/master/gatk4-rna-best-practices.wdl section

I have RNASeq data and I have used a GrCh38.p13 genomic reference not in the GATK resource bucket and currently I have processed around 600 samples with this same reference and followed till the Split n Cigar step with no problem

In the BaseRecalibrator step I get the following error :

a) GATK version

The Genome Analysis Toolkit (GATK) v4.1.7.0 HTSJDK Version: 2.21.2 Picard Version: 2.21.9

b) Exact GATK commands used

GATK BaseRecalibrator -I input.bam -R reference.fasta --known-sites resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf --known-sites resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf --known-sites resources_broad_hg38_v0_Homo_sapiens_assembly38.known_indels.vcf --known-sites resources_broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf -O recal_data.table

c) The entire error log if applicable.

A USER ERROR has occurred: Input files reference and features have incompatible contigs: No overlapping contigs found. reference contigs = [NC_000001.11, NT_187361.1, NT_187362.1, NT_187363.1, NT_187364.1, NT_187365.1, NT_187366.1, NT_187367.1, NT_187368.1, NT_187369.1, NC_000002.12, NT_187370.1, NT_187371.1, NC_000003.12, NT_167215.1, NC_000004.12, NT_113793.3

I downloaded these files from the Resource Bundle. Could you please tell me why they aren't working. Since re-mapping at this stage is not preferable and I am sure I am using the same reference as I used for all steps before this.

Please help. Thank you very much

gatk • 566 views
ADD COMMENTlink written 6 months ago by Seigfried70

and I have used a GrCh38.p13 genomic reference not in the GATK resource bucket

Yep, there you have it. Your reference is NCBI nomenclature (chromosome names) and your bundle is not. Either remap the files with a different reference, or try to rename the chromosomes in every of the BAM files or try to find a way to convert the bundle resources to the nomenclature of the bundle.

ADD REPLYlink written 6 months ago by ATpoint41k

So when the error says that the feature contigs are different

features contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_KI270706v1_random, chr1_KI270707v1_random, chr1_KI270708v1_random, chr1_KI270709v1_random, chr1_KI270710v1_random, .....

I thought dbSNP was from NCBI. So atleast that should have worked?

ADD REPLYlink written 6 months ago by Seigfried70

reference contigs = [NC_000001.11, NT_187361.1, NT_187362.1, NT_187363.1, NT_187364.1, NT_187365.1, NT_187366.1, NT_187367.1, NT_187368.1, NT_187369.1, NC_000002.12, NT_187370.1, NT_187371.1, NC_000003.12, NT_167215.1, NC_000004.12, NT_113793.3

These are your contig names. Don't ask me why NCBI thought that distributing reference genomes with these odd NT_... was a good idea, I personally find it terrible, but this is what you have.

ADD REPLYlink written 6 months ago by ATpoint41k

@ATpoint Thank you for the replies. I understand the issue and I am feeling hopeless. Is there any way at all without remapping? Any tool I can use. As of this moment I don't even know which contig in Reference == which contig in Features.

ADD REPLYlink written 6 months ago by Seigfried70

To be sure, can you from one of the BAM file post the output of samtools idxstats that.bam? Don't worry, we find a solution.

ADD REPLYlink modified 6 months ago • written 6 months ago by ATpoint41k
samtools idxstats SRR7621519_splitncigar.bam
NC_000001.11    248956422       164233  0
NT_187361.1     175055  9       0
NT_187362.1     32032   0       0
NT_187363.1     127682  0       0
...
NT_187686.1     215732  0       0
NT_187687.1     170537  0       0
NT_113949.2     177381  0       0
NC_012920.1     16569   44480   0
*       0       0       0
ADD REPLYlink written 6 months ago by Seigfried70
1

The conversion table you need is probably this one here https://github.com/dpryan79/ChromosomeMappings/blob/master/GRCh38_RefSeq2UCSC.txt but I am not sure there is a command line tool for this which does not involve custom awk or similar. I will ask in our Slack if someone has a good idea.

ADD REPLYlink written 6 months ago by ATpoint41k

Thank you for your help! I will try to find if I can find a way to replace in BAM files. Thanks a lot!

ADD REPLYlink written 6 months ago by Seigfried70
1

An "official" NCBI assembly report for hg38 is located here:

ftp://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_report.txt

There are two columns in this file that are probably relevant to your question: RefSeq-Accn and UCSC-style-name. You can use these as a key-value pairing for lookups.

For example, NC_000001.11 maps to chr1, NT_187361.1 maps to chr1_KI270706v1_random, and so on.

You could pass this file into a Python or Perl script to map these names, also passing in your BAM as SAM (text). Reheader or rewrite by a map lookup, and then convert back from SAM to BAM.

ADD REPLYlink modified 6 months ago • written 6 months ago by Alex Reynolds31k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1226 users visited in the last hour