plink vcf recodeA error ALT allele duplicates REF allele
2
0
Entering edit mode
7.1 years ago

I am using plink to to export sequence data in vcf format to a raw file but keep getting the following error at a specific line that halts the entire process:

"ALT allele duplicates REF allele on line 166993 of .vcf file"

using the following command

plink --vcf "infile.vcf.gz" --extract exomeids.txt --out "outfile" --recodeA

When I examine this line I see that all subjects appear to be homozygous (listed as either 0/0 or 1/1). I can remove this exact chromosome or position but then get the same error on a different line. Is there a more succinct way to remove this error than listing all the positions that result in this error?

Note that this position is not one that I want to extract for my results but it is still processed in the initial part of the --out command. I have also used similar commands for other vcf files with no errors so I am curious if something is wrong with the coding on this file since I cannot find documentation of these error elsewhere.

Thanks for your help in advance.

plink • 4.2k views
ADD COMMENT
1
Entering edit mode
7.1 years ago

How was this VCF file generated? The REF and ALT alleles should never be identical; monomorphic variants should have either "." or some unobserved alternate allele in the ALT column.

With that said, you can recover from this situation with a script that detects lines where the REF and ALT columns are identical, and (i) sets ALT to "." and (ii) replaces all instances of "1/1" with "0/0" on those lines.

If that's too much of a hassle, you can also just delete those lines with cat infile.vcf | grep '^#' > infile_filtered.vcf followed by cat infile.vcf | grep -v '^#' | awk '{if ($4 != $5) print $0}' >> infile_filtered.vcf

ADD COMMENT
0
Entering edit mode

Yes, thanks for the code. I was about to go down that path but was hoping there was a more simple way to exclude them from the --out command

Note: --maf setting restricting minor allele frequencies to larger values hasn't worked to exclude these either which doesn't fully make sense to me either.

ADD REPLY
1
Entering edit mode

This is because VCF import happens before everything else plink does: the entire file is imported, and only then are flags like --maf applied. Otherwise, it would be necessary to add --maf logic to every single import routine, etc.

(The full order of operations is at https://www.cog-genomics.org/plink/1.9/order .)

ADD REPLY
0
Entering edit mode

Thanks. Now I understand why --maf is not working but the --not-chr is effective since it is processed before the --vcf turns the file into binary format. Based on this order of operations it appears that I would not be able to exclude an exact position within a chromosome using --from --to commands since that happens after --vcf processing. I guess I'm back to deleting the lines of the file as was originally suggested.

Appreciate all the help!

ADD REPLY
1
Entering edit mode
7.1 years ago
awk '($0 ~/^#/ || $4!=$5)' input.vcf > out.vcf
ADD COMMENT

Login before adding your answer.

Traffic: 1409 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6