Question

Split multiple annotations from SnpEff

0

Entering edit mode

19 months ago

biomonte ▴ 220

The INFO field (8th column) of SnpEff can have multiple annotations for a particular SNP. For example:

1   52100121    .   T   A   2605.89 .   AC=2;AN=2;ANN=A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034394.1|protein_coding|26/27|c.7207+1962A>T||||||,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034408.1|protein_coding|25/26|c.6514+1962A>T||||||,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034418.1|protein_coding|23/24|c.5716+813A>T||||||,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034451.1|protein_coding|23/24|c.5698+1962A>T||||||,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034483.1|protein_coding|24/25|c.5638+1962A>T||||||    GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0

I would like to put each annotation into a different row but keeping the rest of the information untouched, just like so:

1   52100121    .   T   A   2605.89 .   AC=2;AN=2;ANN=A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034394.1|protein_coding|26/27|c.7207+1962A>T||||||   GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
1   52100121    .   T   A   2605.89 .   AC=2;AN=2;ANN=A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034408.1|protein_coding|25/26|c.6514+1962A>T||||||   GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
1   52100121    .   T   A   2605.89 .   AC=2;AN=2;ANN=A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034418.1|protein_coding|23/24|c.5716+813A>T||||||    GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
1   52100121    .   T   A   2605.89 .   AC=2;AN=2;ANN=A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034451.1|protein_coding|23/24|c.5698+1962A>T||||||   GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
1   52100121    .   T   A   2605.89 .   AC=2;AN=2;ANN=A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034483.1|protein_coding|24/25|c.5638+1962A>T||||||   GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0

I wrote the following script that can successfully achieve this task using bash and awk, but it is crazy inefficient on large files with millions of SNPs:

while read SNP ; \
do \
awk -F"\t" -v snp_pos="$SNP" '$2==snp_pos{print $8}' snpeff_annotation_results.vcf | \
awk -F"ANN=" '{print $2}' | \
sed 's/,/\n/g' | \
sed "s/^/$(awk -F"\t" -v snp_pos="$SNP" '$2==snp_pos{print $8}' \
snpeff_annotation_results.vcf | \
awk -F"|" '{print $1}' | \
awk '{print substr($0, 1, length($0)-1)}')/g" | \
sed "s/^/$(awk -F"\t" -v snp_pos="$SNP" '$2==snp_pos{print $1,$2,$3,$4,$5,$6,$7}' \
snpeff_annotation_results.vcf) /g" | \
sed "s%\$% $(awk -F"\t" -v snp_pos="$SNP" '$2==snp_pos{print $9,$10}' \
snpeff_annotation_results.vcf)%g" \
; done < snps.list >> snpeff_annotation_results.vcf.splitted

Does anyone know an alternative way to do this task faster?

Additional comment 1: the files 'snpeff_annotation_results.vcf' and 'snps.list' (not shown) contain annotation results from SnpEff and the position of millions of snps (one per row), respectively. Here I am just posting one SNP to make my question more straightforward.

Additional comment 2: besides bash/awk, I have experience in Python, Perl and R. So, I would be grateful if the solution could be with any of these languages.

SnpEff AWK Bash VCF Annotation • 829 views

ADD COMMENT • link updated 19 months ago by iraun 6.2k • written 19 months ago by biomonte ▴ 220

score 3 · Accepted Answer · 2022-09-23

3

Entering edit mode

19 months ago

iraun 6.2k

Hi! Would something like this work for you?

Let me know if you need any explanation of the code :)

awk -F'\t' '{split($8,info,"ANN="); split(info[2],ann,",");  for (i in ann) {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"info[1]","ann[i]"\t"$9"\t"$10}}' test.vcf

ADD COMMENT • link 19 months ago by iraun 6.2k

1

Entering edit mode

Hi iraun ! Your code works great and it seems to run super fast compared to mine, so thank you very much for your help :) there is only one minor issue which is the "ANN=" is not being printed, but I think I can easily fix this with 'sed' after running the script, just like so: sed 's/;,/;ANN=/g'

1   52100121    .   T   A   2605.89 .   AC=2;AN=2;,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034451.1|protein_coding|23/24|c.5698+1962A>T||||||  GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
1   52100121    .   T   A   2605.89 .   AC=2;AN=2;,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034483.1|protein_coding|24/25|c.5638+1962A>T||||||  GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
1   52100121    .   T   A   2605.89 .   AC=2;AN=2;,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034394.1|protein_coding|26/27|c.7207+1962A>T||||||  GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
1   52100121    .   T   A   2605.89 .   AC=2;AN=2;,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034408.1|protein_coding|25/26|c.6514+1962A>T||||||  GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0
1   52100121    .   T   A   2605.89 .   AC=2;AN=2;,A|intron_variant|MODIFIER|WNK1|ENSCJPG00005019704|transcript|ENSCJPT00005034418.1|protein_coding|23/24|c.5716+813A>T||||||   GT:DP:AD:RO:QR:AO:QA:GL 1/1:11:0,11:0:0:11:451:-40.9457,-3.31133,0

ADD REPLY • link 19 months ago by biomonte ▴ 220

1

Entering edit mode

Glad it worked! And you are totally right, ANN= is missing. You can probably use sed as you mentioned, or slightly modify the awk command as follows:

awk -F'\t' '{split($8,info,"ANN=");split(info[2],ann,","); for (i in ann) {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"info[1]"ANN="ann[i]"\t"$9"\t"$10}}'

ADD REPLY • link 19 months ago by iraun 6.2k