Question: How to use ExAC/gnomAD without chrN_random?
0
gravatar for dodausp
5 months ago by
dodausp140
Denmark/Copenhagen/BRIC
dodausp140 wrote:

Hi,

I am trying to estimate my sample contamination based on gatk's GetPileupSummaries / CalculateContamination best practices. This pipeline requires a .vcf file containing common germline variant sites, which here I am resourcing from the ExAC database (now migrated to gnomAD). However, when using GetPileupSummaries, I get the following error:

A USER ERROR has occurred: Input files reads and features have incompatible contigs: Found contigs with the same name but different lengths:
contig reads = chrM / 16569
contig features = chrM / 16571.

reads contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]
features contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl000206_random, chrUn_gl000232, chrUn_gl000234, chr11_gl000202_random, chrUn_gl000238, chrUn_gl000244, chrUn_gl000248, chr8_gl000196_random, chrUn_gl000249, chrUn_gl000246, chr17_gl000203_random, chr8_gl000197_random, chrUn_gl000245, chrUn_gl000247, chr9_gl000201_random, chrUn_gl000235, chrUn_gl000239, chr21_gl000210_random, chrUn_gl000231, chrUn_gl000229, chrM, chrUn_gl000226, chr18_gl000207_random]

I expected this issues to appear, because the samples were aligned to the hg19 build, but without the chrN_random tables. So, my question is: is there any way that one can edit, or avoid, those chrN_random annotations on the ExAC file? (the file is quite large though - ~6.5GB, not feasible to open on a text editor).

Any help is greatly appreciated!

PS: one solution would be to realign my samples to the whole hg19 build. However, I would like to take this as a last resort, once this step is performed by the sequencing pipeline (IonTorrent), which already accounts for technical bias (i.e. in homopolymer regions) on their data pre-processing.

ADD COMMENTlink modified 5 months ago • written 5 months ago by dodausp140

not feasible to open on a text editor

try excel !!! :-D

ADD REPLYlink modified 5 months ago • written 5 months ago by Pierre Lindenbaum128k
1
gravatar for dodausp
5 months ago by
dodausp140
Denmark/Copenhagen/BRIC
dodausp140 wrote:

Case solved.

Basically, I liftedOver the ExAC file to the hg19 build used on our sequecing platform (hg19 without those chrN_random annotations), and arranged it so gatk's GetPileupSummaries could interpret it. In case anybody comes across that, here are the steps:

1. LiftingOver ExAC file (picard)

java -jar picard.jar LiftoverVcf I=ExAC.r1.sites.vep.vcf.gz \
O=output_1.vcf.gz \
CHAIN=b37tohg19.chain \
REJECT=rejected_variants.vcf \
R=/genome/reference/file.fasta \
ALLOW_MISSING_FIELDS_IN_HEADER=true MAX_RECORDS_IN_RAM=100000

2. Fixing the header of the file (picard)

java -jar picard.jar FixVcfHeader \
I=output_1.vcf.gz \
O=output_2.vcf.gz

3. Extracting only biallelic entries (gatk)

gatk SelectVariants \
-R /genome/reference.fasta \
-V output_2.vcf.gz \
--restrict-alleles-to BIALLELIC \
-O output_3.vcf.gz

The output_3.vcf.gz is the one to be used on GetPileupSummaries.

In any case, thanks a lot for the suggestion @Pierre Lindenbaum!

And I hope it can help others. (:

ADD COMMENTlink written 5 months ago by dodausp140
0
gravatar for Pierre Lindenbaum
5 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum128k wrote:

This is just a problem of incompatibles dictionaries (using gl*contigs and different version of the MT chromosome ) see problems in hg19 and b37 compatibility .

You could 'cheat' this by removing the 'non 1-22/XY' variants from your vcf file using bcftools and then inject the Gnomad/Exact dictionary into the VCF using : https://broadinstitute.github.io/picard/command-line-overview.html#UpdateVcfSequenceDictionary

ADD COMMENTlink written 5 months ago by Pierre Lindenbaum128k

Thanks a lot, @Pierre Lindenbaum! (: Trying that right now. Will post an update later.

ADD REPLYlink written 5 months ago by dodausp140

I tried, and unfortunately it did not work @Pierre Lindenbaum. I did come up with something else that solved the problem, though (explained below). Thanks a lot for the suggestion! (:

ADD REPLYlink written 5 months ago by dodausp140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1150 users visited in the last hour