Question: Extracting Ancestral allele from INFO column of vcf file
0
gravatar for NS
5 weeks ago by
NS0
NS0 wrote:

Hi From my vcf file, I need to generate a text file with the following headers (shown just an example):
CHROM POS ID REF ALT AA
1 886817 rs11174805 C G C
1 886817 rs111111111 C T .
1 886817 rs11144444 A T A

Here AA column is the Ancestral allele column that can be obtained from the INFO column of the vcf file which has "AA=allele_name". In second row, "." implies missing "AA=." in INFO column and thus the ancestral allele is unknown. i have extracted the columns from my vcf file using: awk 'BEGIN {OFS ="\t" ; FS = "\t"};{print $1, $2, $3, $4, $5, $8}' chr22.vcf

This gives me columns - CHROM, POS, ID, REF, ALT, INFO.
INFO column looks like this: GAC=1;AF=0.000199681;AN=5008;NS=2504;DP=8012;EAS_AF=0;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_AF=0.001;AA=.|||;VT=SNP

Now, using shell script, how can I extract AA from INFO and create a new column with AA alleles in it ?

Thanks

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by NS0
1
gravatar for Kevin Blighe
5 weeks ago by
Kevin Blighe42k
Republic of Ireland
Kevin Blighe42k wrote:

Assuming that AA is defined in your header, like this:

##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele. Format: AA|REF|ALT|IndelType. AA: Ancestral allele, REF:Reference Allele, ALT:Alternate Allele, IndelType:Type of Indel (REF, ALT and IndelType are only defined for indels)">

Then, you can just do this:

bcftools query --format '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%AA\n' 1000Genomes.Norm.bcf | head -20
1   10177   rs367896724 A   AC  |||unknown(NO_COVERAGE)
1   10235   rs540431307 T   TA  |||unknown(NO_COVERAGE)
1   10352   rs555500075 T   TA  |||unknown(NO_COVERAGE)
1   10505   rs548419688 A   T   .|||
1   10506   rs568405545 C   G   .|||
1   10511   rs534229142 G   A   .|||
1   10539   rs537182016 C   A   .|||
1   10542   rs572818783 C   T   .|||
1   10579   rs538322974 C   A   .|||
1   10616   rs376342519 CCGCCGTTGCAAAGGCGCGCCG  C   .
1   10642   rs558604819 G   A   .|||
1   11008   rs575272151 C   G   .|||
1   11012   rs544419019 C   G   .|||
1   11063   rs561109771 T   G   .|||
1   13011   rs574746232 T   G   t|||
1   13110   rs540538026 G   A   g|||
1   13116   rs62635286  T   G   t|||
1   13118   rs200579949 A   G   a|||
1   13156   rs552314247 G   C   g|||
1   13259   rs562993331 G   A   g|||
ADD COMMENTlink written 5 weeks ago by Kevin Blighe42k

Thanks for your help ! Is there also a way to keep only bi-allelic SNPs in this output using bcf tools ?

ADD REPLYlink written 5 weeks ago by NS0
0
gravatar for jared.andrews07
5 weeks ago by
jared.andrews072.3k
St. Louis, MO
jared.andrews072.3k wrote:

In addition to Kevin's answer, GATK has a very nice VariantsToTable tool for extracting data like this that is sometimes easier and quicker when your queries are more complicated.

ADD COMMENTlink written 5 weeks ago by jared.andrews072.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1141 users visited in the last hour