Question: plink vcf recodeA error ALT allele duplicates REF allele
0
gravatar for cynthiamonster
2.5 years ago by
cynthiamonster0 wrote:

I am using plink to to export sequence data in vcf format to a raw file but keep getting the following error at a specific line that halts the entire process:

"ALT allele duplicates REF allele on line 166993 of .vcf file"

using the following command

plink --vcf "infile.vcf.gz" --extract exomeids.txt --out "outfile" --recodeA

When I examine this line I see that all subjects appear to be homozygous (listed as either 0/0 or 1/1). I can remove this exact chromosome or position but then get the same error on a different line. Is there a more succinct way to remove this error than listing all the positions that result in this error?

Note that this position is not one that I want to extract for my results but it is still processed in the initial part of the --out command. I have also used similar commands for other vcf files with no errors so I am curious if something is wrong with the coding on this file since I cannot find documentation of these error elsewhere.

Thanks for your help in advance.

plink • 1.4k views
ADD COMMENTlink modified 2.5 years ago by Pierre Lindenbaum123k • written 2.5 years ago by cynthiamonster0
1
gravatar for chrchang523
2.5 years ago by
chrchang5235.6k
United States
chrchang5235.6k wrote:

How was this VCF file generated? The REF and ALT alleles should never be identical; monomorphic variants should have either "." or some unobserved alternate allele in the ALT column.

With that said, you can recover from this situation with a script that detects lines where the REF and ALT columns are identical, and (i) sets ALT to "." and (ii) replaces all instances of "1/1" with "0/0" on those lines.

If that's too much of a hassle, you can also just delete those lines with cat infile.vcf | grep '^#' > infile_filtered.vcf followed by cat infile.vcf | grep -v '^#' | awk '{if ($4 != $5) print $0}' >> infile_filtered.vcf

ADD COMMENTlink written 2.5 years ago by chrchang5235.6k

Yes, thanks for the code. I was about to go down that path but was hoping there was a more simple way to exclude them from the --out command

Note: --maf setting restricting minor allele frequencies to larger values hasn't worked to exclude these either which doesn't fully make sense to me either.

ADD REPLYlink written 2.5 years ago by cynthiamonster0
1

This is because VCF import happens before everything else plink does: the entire file is imported, and only then are flags like --maf applied. Otherwise, it would be necessary to add --maf logic to every single import routine, etc.

(The full order of operations is at https://www.cog-genomics.org/plink/1.9/order .)

ADD REPLYlink written 2.5 years ago by chrchang5235.6k

Thanks. Now I understand why --maf is not working but the --not-chr is effective since it is processed before the --vcf turns the file into binary format. Based on this order of operations it appears that I would not be able to exclude an exact position within a chromosome using --from --to commands since that happens after --vcf processing. I guess I'm back to deleting the lines of the file as was originally suggested.

Appreciate all the help!

ADD REPLYlink written 2.5 years ago by cynthiamonster0
1
gravatar for Pierre Lindenbaum
2.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:
awk '($0 ~/^#/ || $4!=$5)' input.vcf > out.vcf
ADD COMMENTlink written 2.5 years ago by Pierre Lindenbaum123k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1010 users visited in the last hour