Filter out overlapping deletions from a VCF file.
0
0
Entering edit mode
14 months ago

I try to filter out the spanning or overlapping deletions in a GVCF file, noted as asterisk in the VCF format. https://gatk.broadinstitute.org/hc/en-us/articles/360035531912-Spanning-or-overlapping-deletions-allele-

I have tried different bcftools command for this:

bcftools view -f '%ALT != *' -O z -o GVCF_SNPs_output.vcf.gz GVCF_input.vcf.gz

bcftools filter -i 'FORMAT/ALT="*"' -O z -o GVCF_SNPs_output.vcf.gz   GVCF_input.vcf.gz

But it seems not to work for loci having several alternative alleles (for example "A, *, C" in the ALT field).

Would anyone have successfully filtered out the deletions (*) out of a GVCF file?

vcf bcftools gvcf • 1.2k views
ADD COMMENT
1
Entering edit mode

If you're using GATK, following the joint genotyping with GenotypeGVCFs, you could use SelectVariants. For example:

gatk SelectVariants \
     -R ref.fasta \
     -V input_cohort.vcf \
     --select-type-to-include SNP \
     -O cohort_SNP.vcf

I only say afer GenotypeGVCFs as I don't know the implications of removing variants from a GVCF file. Alternatively, you could also use the --select-type-to-exclude parameter if you want more than just SNPs, though I can't see what type of variant * is in the docs.

ADD REPLY
0
Entering edit mode

Thank you for the answer dthorbur, it is interesting as I have already used this command to make the GVCF file. Which means that apparently overlapping deletions are not removed by this command.

~/gatk-4.2.0.0/gatk SelectVariants \
 --exclude-filtered true \
 --select-type-to-include SNP \
 --variant ./GenotypeGVCFs_filt.vcf.gz \
 --output ./GVCF_SNPs.vcf.gz
ADD REPLY
0
Entering edit mode

Damn. I remember I had this problem too a while ago as MSMC wouldn't accept * annotations, but I believe I just removed all sites where they were present.

I also found this previous forum post, which may offer a solution:

bcftools norm --check-ref w -f reference.fasta -m -any cohort_SNP.vcf.gz > cohort_SNP.norm.vcf

Where multiallelic annotations appear to be given their own line. Whilst this then would permit removal of * entries, it may result in multiple lines for otherwise multiallelic sites you want to keep.

Why do you need to remove the * annotations anyway?

ADD REPLY
0
Entering edit mode

Thank you for the script dthorbur, I will give it a go.

I want to use the population genetic software angsd to analyse my dataset. http://popgen.dk/angsd/index.php/ANGSD

Unfortunately, it seems that the * sites are for the moment not recognised by angsd. https://github.com/ANGSD/angsd/issues/557#issuecomment-1435521926

I anyone has a simple solution for this, I would be interested.

ADD REPLY

Login before adding your answer.

Traffic: 2417 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6