How to use ExAC/gnomAD without chrN_random?
2
0
Entering edit mode
4.4 years ago
dodausp ▴ 180

Hi,

I am trying to estimate my sample contamination based on gatk's GetPileupSummaries / CalculateContamination best practices. This pipeline requires a .vcf file containing common germline variant sites, which here I am resourcing from the ExAC database (now migrated to gnomAD). However, when using GetPileupSummaries, I get the following error:

A USER ERROR has occurred: Input files reads and features have incompatible contigs: Found contigs with the same name but different lengths:
contig reads = chrM / 16569
contig features = chrM / 16571.

reads contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]
features contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl000206_random, chrUn_gl000232, chrUn_gl000234, chr11_gl000202_random, chrUn_gl000238, chrUn_gl000244, chrUn_gl000248, chr8_gl000196_random, chrUn_gl000249, chrUn_gl000246, chr17_gl000203_random, chr8_gl000197_random, chrUn_gl000245, chrUn_gl000247, chr9_gl000201_random, chrUn_gl000235, chrUn_gl000239, chr21_gl000210_random, chrUn_gl000231, chrUn_gl000229, chrM, chrUn_gl000226, chr18_gl000207_random]

I expected this issues to appear, because the samples were aligned to the hg19 build, but without the chrN_random tables. So, my question is: is there any way that one can edit, or avoid, those chrN_random annotations on the ExAC file? (the file is quite large though - ~6.5GB, not feasible to open on a text editor).

Any help is greatly appreciated!

PS: one solution would be to realign my samples to the whole hg19 build. However, I would like to take this as a last resort, once this step is performed by the sequencing pipeline (IonTorrent), which already accounts for technical bias (i.e. in homopolymer regions) on their data pre-processing.

sequencing variant calling SNP ExAc • 1.6k views
ADD COMMENT
0
Entering edit mode

not feasible to open on a text editor

try excel !!! :-D

ADD REPLY
1
Entering edit mode
4.4 years ago
dodausp ▴ 180

Case solved.

Basically, I liftedOver the ExAC file to the hg19 build used on our sequecing platform (hg19 without those chrN_random annotations), and arranged it so gatk's GetPileupSummaries could interpret it. In case anybody comes across that, here are the steps:

1. LiftingOver ExAC file (picard)

java -jar picard.jar LiftoverVcf I=ExAC.r1.sites.vep.vcf.gz \
O=output_1.vcf.gz \
CHAIN=b37tohg19.chain \
REJECT=rejected_variants.vcf \
R=/genome/reference/file.fasta \
ALLOW_MISSING_FIELDS_IN_HEADER=true MAX_RECORDS_IN_RAM=100000

2. Fixing the header of the file (picard)

java -jar picard.jar FixVcfHeader \
I=output_1.vcf.gz \
O=output_2.vcf.gz

3. Extracting only biallelic entries (gatk)

gatk SelectVariants \
-R /genome/reference.fasta \
-V output_2.vcf.gz \
--restrict-alleles-to BIALLELIC \
-O output_3.vcf.gz

The output_3.vcf.gz is the one to be used on GetPileupSummaries.

In any case, thanks a lot for the suggestion @Pierre Lindenbaum!

And I hope it can help others. (:

ADD COMMENT
0
Entering edit mode
4.4 years ago

This is just a problem of incompatibles dictionaries (using gl*contigs and different version of the MT chromosome ) see problems in hg19 and b37 compatibility .

You could 'cheat' this by removing the 'non 1-22/XY' variants from your vcf file using bcftools and then inject the Gnomad/Exact dictionary into the VCF using : https://broadinstitute.github.io/picard/command-line-overview.html#UpdateVcfSequenceDictionary

ADD COMMENT
0
Entering edit mode

Thanks a lot, @Pierre Lindenbaum! (: Trying that right now. Will post an update later.

ADD REPLY
0
Entering edit mode

I tried, and unfortunately it did not work @Pierre Lindenbaum. I did come up with something else that solved the problem, though (explained below). Thanks a lot for the suggestion! (:

ADD REPLY

Login before adding your answer.

Traffic: 2510 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6