Question: Get alternate / total allele counts for 1000 Genomes super-populations
0
gravatar for Sean
4.8 years ago by
Sean180
United States
Sean180 wrote:

I would like to get the alternate allele counts (AC) and the total allele counts (AN) for any variant in each of the five 1000 Genomes super-populations (AFR, AMR, EAS, EUR, SAS) as well as the global population (ALL).

1000 Genomes offers its Allele Frequency Calculator which gives an output for the global population (ALL) and each sub-population (ACB, ACW, BEB, etc.) like the following:

CHR   POS    ID  REF  ALT   ALL_POP_TOTAL_CNT  ALL_POP_ALT_CNT  ALL_POP_FRQ  ...
1     10177  .   A    AC    5008               2130             0.43         ...

This gives me exactly what I need, but ideally I would like to have a solution that I can implement in a pipeline (aka independent of the online interface), perhaps using vcftools or bcftools. I know I can sum the values for the sub-populations to get the values for each respective super-population, but I also wonder if there is a simpler/faster way that I'm missing.

What I've tried already:

- I can easily get AF for the global and super-populations using ANNOVAR, but I still need AC and AN.

- I can get AC and AF from dbNSFP 2, but this limits the variants to non-synonymous SNPs only. Technically, I could calculate AN by dividing AC by AF, but this introduces rounding errors because AF has been truncated. Additionally, if AC and AF are zero, then I won't be able to calculate AN at all.

- I have tried using the fill-an-ac script for VCF files using the technique suggested here. This will update the AN and AC fields just fine, but it doesn't update the AF field for some reason.

- I've dabbled in the idea of adding up the genotypes (e.g. 0|0, 0|1, 1|1, etc.) in the VCF/BCF files, but I was hoping to avoid this if possible.

Question:

How can I get the AC, AN, and AF of any variant for each of the five 1000 Genomes super-populations as well as the global population? Can I do this without first calculating the sub-populations?

NOTE: I know AF is included in the 1000 Genomes VCF/BCF files, but if someone knows how to get AC, AN, and AF in one fell swoop (similar to Allele Frequency Calculator) then it would be greatly appreciated.

ADD COMMENTlink modified 4.7 years ago • written 4.8 years ago by Sean180
1
gravatar for Sean
4.7 years ago by
Sean180
United States
Sean180 wrote:

For this, I used a combination of bcftools (to calculate AC and AN) and awk (to calculate AF).

First, download the panel file:

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel

Next, download the BCF file:

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/bcf_files/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.bcf

Then save the following awk script into a file called fill-AF.awk (you can copy and paste the code as-is; you don't need to edit anything):

fill-AF.awk

# Print header
/^#/{ print }

# Calculate AF(s)
/^[^#]/{
  # Get AC(s)
  match($0, /[\t ;]AC=[^\t ;]*/)
  num_acs = split(substr($0, RSTART+4, RLENGTH-4), acs, ",")

  # Get AN
  match($0, /[\t ;]AN=[^\t ;]*/)
  an = substr($0, RSTART+4, RLENGTH-4)

  # Print everything before original AF
  match($0, /[\t ;]AF=[^\t ;]*/)
  printf("%s", substr($0, 0, RSTART+3))

  # Calculate AF(s)
  for (i = 1; i <= num_acs; i++) {
    printf("%f", acs[i]/an)
    if (i < num_acs) {
      printf(",")
    }
  }

  # Print everything after original AF
  print(substr($0, RSTART+RLENGTH))
}

Next, save the following code into a bash script called fill-AN-AC-AF.sh (copy and paste as-is, except you may need to change your bash path at the top):

fill-AN-AC-AF.sh

#!/bin/bash
# Define population
POP=$1

# Get a list of samples
grep $POP integrated_call_samples_v3.20130502.ALL.panel | cut -f1 > $POP.samples.list

# This part has several steps as follows:
# STEP 1: bcftools recalculates the INFO/AC and INFO/AN fields
#           -G Drops genotype info in output file
#           -S File of sample names to include
#           -O Output as VCF
#            NOTE: This will NOT recalculate the INFO/AF field yet
# STEP 2: awk program recalculates INFO/AF by dividing INFO/AN by INFO/AC
# STEP 3: bgzip compresses the output into a vcf.gz file
bcftools view \
  -G \
  -S $POP.samples.list \
  -O v \
  ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.bcf.gz \
  | awk -f fill-AF.awk \
  | bgzip > $POP.vcf.gz

# Create tabix index
tabix -p vcf $POP.vcf.gz

Run the bash script, passing in the target population as an argument. This can either be a super-population (e.g. AFR, AMR, EAS, EUR, or SAS) or a sub-population (ACB, ASW, BEB, CDX, CEU, etc.). For example, if I wanted to recalculate AC, AN, and AF for the African super-population, I would run:

./fill-AN-AC-AF.sh AFR

After this is done running, a file called (in this case) AFR.vcf.gz will be created with the correct AC, AN, and AF for the African super-population. An example query to this file would be:

bcftools query -f'%CHROM\t%POS\t%REF\t%ALT\t%INFO/AC\t%INFO/AN\t%INFO/AF\n' AFR.vcf.gz

This would give an output formatted as follows:

1       536816  C       T       34      694     0.048991
1       536866  A       G       1       694     0.001441
1       536895  T       C,G     10,2    694     0.014409,0.00288
ADD COMMENTlink modified 11 months ago by RamRS24k • written 4.7 years ago by Sean180
0
gravatar for Jorge Amigo
4.8 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

I would probably have tried bcftools before, but if I understood it correctly and if you're ok getting AF from ANNOVAR you'll be able to get AN by multiplying Nsamples*2 and AC by multiplying Nsamples*AF for each super-population. it would be a 2-step process, but that second process could be done with a simple awk command.

ADD COMMENTlink written 4.8 years ago by Jorge Amigo11k

Jorge, I like this answer, but I'm still afraid that multiplying Nsamples*AF to get AC will introduce rounding errors because AF is truncated. As an additional note, if I was to do it your way, I could actually skip using ANNOVAR and pull the super-population frequencies directory from the VCF/BCF file. I came up with a solution that worked for my purposes if you'd like to check it out. Thanks for your input.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by Sean180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1771 users visited in the last hour