Question

Get alternate / total allele counts for 1000 Genomes super-populations

0

Entering edit mode

9.4 years ago

Sean ▴ 390

I would like to get the alternate allele counts (AC) and the total allele counts (AN) for any variant in each of the five 1000 Genomes super-populations (AFR, AMR, EAS, EUR, SAS) as well as the global population (ALL).

1000 Genomes offers its Allele Frequency Calculator which gives an output for the global population (ALL) and each sub-population (ACB, ACW, BEB, etc.) like the following:

CHR   POS    ID  REF  ALT   ALL_POP_TOTAL_CNT  ALL_POP_ALT_CNT  ALL_POP_FRQ  ...
1     10177  .   A    AC    5008               2130             0.43         ...

This gives me exactly what I need, but ideally I would like to have a solution that I can implement in a pipeline (aka independent of the online interface), perhaps using vcftools or bcftools. I know I can sum the values for the sub-populations to get the values for each respective super-population, but I also wonder if there is a simpler/faster way that I'm missing.

What I've tried already:

I can easily get AF for the global and super-populations using ANNOVAR, but I still need AC and AN.
I can get AC and AF from dbNSFP 2, but this limits the variants to non-synonymous SNPs only. Technically, I could calculate AN by dividing AC by AF, but this introduces rounding errors because AF has been truncated. Additionally, if AC and AF are zero, then I won't be able to calculate AN at all.
I have tried using the fill-an-ac script for VCF files using the technique suggested here. This will update the AN and AC fields just fine, but it doesn't update the AF field for some reason.
I've dabbled in the idea of adding up the genotypes (e.g. 0|0, 0|1, 1|1, etc.) in the VCF/BCF files, but I was hoping to avoid this if possible.

Question:

How can I get the AC, AN, and AF of any variant for each of the five 1000 Genomes super-populations as well as the global population? Can I do this without first calculating the sub-populations?

NOTE: I know AF is included in the 1000 Genomes VCF/BCF files, but if someone knows how to get AC, AN, and AF in one fell swoop (similar to Allele Frequency Calculator) then it would be greatly appreciated.

1000Genomes vcf allele-counts bcf • 5.0k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by Sean ▴ 390

0

Entering edit mode

9.4 years ago

Jorge Amigo 14k

I would probably have tried bcftools before, but if I understood it correctly and if you're ok getting AF from ANNOVAR you'll be able to get AN by multiplying Nsamples * 2 and AC by multiplying Nsamples * AF for each super-population. it would be a 2-step process, but that second process could be done with a simple awk command.

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by Jorge Amigo 14k

0

Entering edit mode

Jorge, I like this answer, but I'm still afraid that multiplying Nsamples * AF to get AC will introduce rounding errors because AF is truncated. As an additional note, if I was to do it your way, I could actually skip using ANNOVAR and pull the super-population frequencies directory from the VCF/BCF file. I came up with a solution that worked for my purposes if you'd like to check it out. Thanks for your input.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Sean ▴ 390

Ram · Accepted Answer · 2014-12-22

For this, I used a combination of bcftools (to calculate AC and AN) and awk (to calculate AF).

First, download the panel file:

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel

Next, download the BCF file:

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/bcf_files/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.bcf

Then save the following awk script into a file called fill-AF.awk (you can copy and paste the code as-is; you don't need to edit anything):

fill-AF.awk

# Print header
/^#/{ print }

# Calculate AF(s)
/^[^#]/{
  # Get AC(s)
  match($0, /[\t ;]AC=[^\t ;]*/)
  num_acs = split(substr($0, RSTART+4, RLENGTH-4), acs, ",")

  # Get AN
  match($0, /[\t ;]AN=[^\t ;]*/)
  an = substr($0, RSTART+4, RLENGTH-4)

  # Print everything before original AF
  match($0, /[\t ;]AF=[^\t ;]*/)
  printf("%s", substr($0, 0, RSTART+3))

  # Calculate AF(s)
  for (i = 1; i <= num_acs; i++) {
    printf("%f", acs[i]/an)
    if (i < num_acs) {
      printf(",")
    }
  }

  # Print everything after original AF
  print(substr($0, RSTART+RLENGTH))
}

Next, save the following code into a bash script called fill-AN-AC-AF.sh (copy and paste as-is, except you may need to change your bash path at the top):

fill-AN-AC-AF.sh

#!/bin/bash
# Define population
POP=$1

# Get a list of samples
grep $POP integrated_call_samples_v3.20130502.ALL.panel | cut -f1 > $POP.samples.list

# This part has several steps as follows:
# STEP 1: bcftools recalculates the INFO/AC and INFO/AN fields
#           -G Drops genotype info in output file
#           -S File of sample names to include
#           -O Output as VCF
#            NOTE: This will NOT recalculate the INFO/AF field yet
# STEP 2: awk program recalculates INFO/AF by dividing INFO/AN by INFO/AC
# STEP 3: bgzip compresses the output into a vcf.gz file
bcftools view \
  -G \
  -S $POP.samples.list \
  -O v \
  ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.bcf.gz \
  | awk -f fill-AF.awk \
  | bgzip > $POP.vcf.gz

# Create tabix index
tabix -p vcf $POP.vcf.gz

Run the bash script, passing in the target population as an argument. This can either be a super-population (e.g. AFR, AMR, EAS, EUR, or SAS) or a sub-population (ACB, ASW, BEB, CDX, CEU, etc.). For example, if I wanted to recalculate AC, AN, and AF for the African super-population, I would run:

./fill-AN-AC-AF.sh AFR

After this is done running, a file called (in this case) AFR.vcf.gz will be created with the correct AC, AN, and AF for the African super-population. An example query to this file would be:

bcftools query -f'%CHROM\t%POS\t%REF\t%ALT\t%INFO/AC\t%INFO/AN\t%INFO/AF\n' AFR.vcf.gz

This would give an output formatted as follows:

1       536816  C       T       34      694     0.048991
1       536866  A       G       1       694     0.001441
1       536895  T       C,G     10,2    694     0.014409,0.00288