Question: Extracting Ancestral allele from INFO column of vcf file
gravatar for NS
22 months ago by
NS10 wrote:

Hi From my vcf file, I need to generate a text file with the following headers (shown just an example):
1 886817 rs11174805 C G C
1 886817 rs111111111 C T .
1 886817 rs11144444 A T A

Here AA column is the Ancestral allele column that can be obtained from the INFO column of the vcf file which has "AA=allele_name". In second row, "." implies missing "AA=." in INFO column and thus the ancestral allele is unknown. i have extracted the columns from my vcf file using: awk 'BEGIN {OFS ="\t" ; FS = "\t"};{print $1, $2, $3, $4, $5, $8}' chr22.vcf

This gives me columns - CHROM, POS, ID, REF, ALT, INFO.
INFO column looks like this: GAC=1;AF=0.000199681;AN=5008;NS=2504;DP=8012;EAS_AF=0;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_AF=0.001;AA=.|||;VT=SNP

Now, using shell script, how can I extract AA from INFO and create a new column with AA alleles in it ?


ADD COMMENTlink modified 22 months ago • written 22 months ago by NS10
gravatar for Kevin Blighe
22 months ago by
Kevin Blighe71k
Republic of Ireland
Kevin Blighe71k wrote:

Assuming that AA is defined in your header, like this:

##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele. Format: AA|REF|ALT|IndelType. AA: Ancestral allele, REF:Reference Allele, ALT:Alternate Allele, IndelType:Type of Indel (REF, ALT and IndelType are only defined for indels)">

Then, you can just do this:

bcftools query --format '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%AA\n' 1000Genomes.Norm.bcf | head -20
1   10177   rs367896724 A   AC  |||unknown(NO_COVERAGE)
1   10235   rs540431307 T   TA  |||unknown(NO_COVERAGE)
1   10352   rs555500075 T   TA  |||unknown(NO_COVERAGE)
1   10505   rs548419688 A   T   .|||
1   10506   rs568405545 C   G   .|||
1   10511   rs534229142 G   A   .|||
1   10539   rs537182016 C   A   .|||
1   10542   rs572818783 C   T   .|||
1   10579   rs538322974 C   A   .|||
1   10616   rs376342519 CCGCCGTTGCAAAGGCGCGCCG  C   .
1   10642   rs558604819 G   A   .|||
1   11008   rs575272151 C   G   .|||
1   11012   rs544419019 C   G   .|||
1   11063   rs561109771 T   G   .|||
1   13011   rs574746232 T   G   t|||
1   13110   rs540538026 G   A   g|||
1   13116   rs62635286  T   G   t|||
1   13118   rs200579949 A   G   a|||
1   13156   rs552314247 G   C   g|||
1   13259   rs562993331 G   A   g|||
ADD COMMENTlink written 22 months ago by Kevin Blighe71k

Thanks for your help ! Is there also a way to keep only bi-allelic SNPs in this output using bcf tools ?

ADD REPLYlink written 22 months ago by NS10
gravatar for jared.andrews07
22 months ago by
Memphis, TN
jared.andrews078.6k wrote:

In addition to Kevin's answer, GATK has a very nice VariantsToTable tool for extracting data like this that is sometimes easier and quicker when your queries are more complicated.

ADD COMMENTlink written 22 months ago by jared.andrews078.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1486 users visited in the last hour