SNP array plink to VCF validation issue
1
0
Entering edit mode
20 months ago

Hi

I am trying to validate my vcf using gatk validateVariants:

gatk ValidateVariants\
   -R Hg19.fasta \
   -V VCF.vcf \
   --validation-type-to-exclude ALL

But I am receiving the following error message:

A USER ERROR has occurred: Input files reference and features have incompatible contigs: Found contigs with the same name but different lengths:

  contig reference = 1 / 249250621

  contig features = 1 / 249212879.

  reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1, NC_007605, hs37d5]

  features contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y]

I am using Hg19 downloaded from gatk as a reference. this is from a test dataset I was using in plink and used --recode flag to output to vcf after performing some QC

the workflow: Export SNP array data from Genome studio into plink format --> Perform QC within plink --> export to vcf format --> Validate VCF for further downstream analysis.

my plink code was as follows: Note: This is just testing the commands

make bed files

plink --file Plink\
 --pheno Phenot --pheno-name Surgery\
 --make-bed --out PlinkP

QC

plink --bfile PlinkP\
 --geno 0.05 --mind 0.05 --maf 0.01 --hwe 0.001\
 --make-bed out/Pass_QC

recode to VCF

 plink --bfile out/Pass_QC\
 --snps-only just-acgt --recode vcf\
 --make-bed --output-chr MT --out out/VCF

view

cat VCF.vcf | grep "^#"

gives this:

fileformat=VCFv4.2

fileDate=

source=PLINKv1.90

contig=<ID=1,length=249212879>

contig=<ID=2,length=243041412>

... other chromosomes ...

contig=<ID=22,length=51214797>

contig=<ID=X,length=154854339>

contig=<ID=Y,length=28650344>

INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">

FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

If I don't perform the QC and convert straight to VCF:

plink --bfile PlinkP\
 --snps-only just-acgt --recode vcf\
 --make-bed --output-chr MT --out NoQC/VCF

gives:

fileformat=VCFv4.2

fileDate=

source=PLINKv1.90

contig=<ID=1,length=249222528>

contig=<ID=2,length=243041412>

... other chromosomes ...
contig=<ID=22,length=51214797>

contig=<ID=X,length=154854339>

contig=<ID=Y,length=58856970>

INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">

FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

Even with the immediate vcf conversion all my contigs are still lower than the reference Genome used (Hg19) for the gatk Validation - does anyone know the reason for this? Even if the reference wasn't Hg19 the contigs still do not match up with others References.

Also is there a way during vcf conversion in plink to have the reference genome so it is included instead of this statement: "Provisional reference allele, may not be based on real reference genome"?

Any advice would be greatly appreciated.

Thanks!

plink vcf gatk Validatevariants • 1.4k views
ADD COMMENT
3
Entering edit mode
20 months ago

The issue is that .bim files do not store reference contig lengths, and plink doesn't have this information hardcoded.

With a recent (April 2022 or newer) plink2 build, you can produce a VCF header with correct contig lengths with

plink2 --bfile PlinkP --fa hs37d5.fa --export vcf

You can download hs37d5.fa from the plink2 Resources page, or many other locations; or if you aligned to a slightly different reference, provide that FASTA to --fa instead.

ADD COMMENT
0
Entering edit mode

Hi

Thanks for your reply.

Using --fa as you said:

plink2 --bfile PlinkP\
 --fa hs37d5.fa.zst\
 --snps-only just-acgt\
 --export vcf\
 --out VCF

gives:

x samples (xx females, x males, x ambiguous; x founders) loaded from PlinkP.fam.

615002 out of 618540 variants loaded from PlinkP.bim.

x binary phenotype loaded (27 cases, 69 controls).

615002 variants remaining after main filters.

--export vcf to VCF.vcf ... done.

doesn't seem to change any output log information and have very similar contig lengths (1 less for each chromosome):

contig=<ID=1,length=249222527>

contig=<ID=2,length=243041411>

... other chromosomes ...

contig=<ID=Y,length=58856969>

contig=<ID=XY,length=155234707>

INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">

FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

I foud that if I use --ref-from-fa after --fa:

plink2 --bfile PlinkP\
 --fa hs37d5.fa.zst --ref-from-fa\
 --snps-only just-acgt\
 --export vcf\
 --out VCF

gives

x samples (x females, x males, x ambiguous; x founders) loaded from

PlinkP.fam.

615002 out of 618540 variants loaded from PlinkP.bim.

x binary phenotype loaded (x cases, x controls).

615002 variants remaining after main filters.

--ref-from-fa: 42847 variants changed, 265698 validated.

--export vcf to VCF.vcf ... done.

but again no change to contig lengths from above:

contig=<ID=1,length=249222527>

contig=<ID=2,length=243041411>

... other Chromosomes ...

contig=<ID=Y,length=58856969>

contig=<ID=XY,length=155234707>

INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">

FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

Do you have any idea why the contigs are still not changing?

The files we are using don't appear to be corrupt or anything.

Agian thanks for your help!

ADD REPLY
0
Entering edit mode

Please post the full .log file, including the very first line with the version string.

ADD REPLY
0
Entering edit mode

Hi Here is is below:

PLINK v2.00a3 SSE4.2 (18 Feb 2022)

Options in effect:

--bfile PlinkP

--export vcf

--fa hs37d5.fa.zst

--out VCF

--ref-from-fa

--snps-only just-acgt

Hostname: richard

Working directory: /x/y/z/PLINK

Start time: Thu Aug 25 16:59:59 2022

Random number seed: 1661443199

3924 MiB RAM detected; reserving 1962 MiB for main workspace.

Using up to 2 compute threads.

x samples (x females, x males, x ambiguous; x founders) loaded from

PlinkP.fam.

615002 out of 618540 variants loaded from PlinkP.bim.

x binary phenotype loaded (x cases, x controls).

615002 variants remaining after main filters.

--ref-from-fa: 42847 variants changed, 265698 validated.

--export vcf to VCF.vcf ... done.

End time: Thu Aug 25 17:00:17 2022

ADD REPLY
1
Entering edit mode

Note that my first response said "April 2022 or newer".

ADD REPLY
0
Entering edit mode

Hi

thanks for your help this seems to work now.

I hadn't realised:

sudo apt -y install plink2

hadn't given me the most up to date version so took it straight from the cog-genomics site.

Thanks :)

ADD REPLY

Login before adding your answer.

Traffic: 2006 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6