Question

Annovar: how to update sequence length based on deletion size

0

Entering edit mode

10.4 years ago

drautuna ▴ 60

Hi, so I'm using Annovar for its gene annotation capabilities and I need some help changing how the input is setup.

The complete command I run on Annovar for its gene annotation capabilities is:

./annotate_variation.pl -geneanno -buildver hg19 MyData.avinput humandb

Here is a small example portion of my data

1       101814       101814       G       T      rs1231
1       1018940       1018940       T       C       rs546754564
1       1020131       1020131       A       -       rs234324
1       1032136       1032136       -       T       rs21313
1       1020514       1020514       T       G      rs645654
1       1022394       1022394       C       G      rs4356354
1       9023126      9023126       TA       -       rs4542342
1       10270690       10270690       CTCA     -       rs3275676

Where the first two variants are a simple base substitution, the third variant is a deletion of that base, and the last two variants are deletions of those bases. Here, all of the variants will get annotated, besides the last two variants, which produce an invalid_input error. This is because the last two sequence ranges must reflect the length of the DNA being deleted, in this case, 2bp and 4bp respectively.

In order to fix it, we'd have to make the last two lines say

1       9023126      9023127       TA       -       
1       10270690       10270693       CTCA     -

To properly reflect the length of the sequence being deleted.

The problem is, my mentor gave me the data in the such erroneous format, with many many variants in this form, so I cannot do it manually. How might I do this computationally?

I know the psuedocode for such a problem would first

1) check if it was a deletion by

a) checking if the 5th column is a minus "-" character for that row, and then

b) checking the 4th column in that same row, if (a) was true, and seeing if it was a string of letters

if the latter is true, then

2) check how many letters the 4th column is, call that value "n"

3) add n-1 to the value in column 3.

How might I carry this out computationally on UNIX? I'm still kind of a novice at bioinformatics, but I'm pretty decent with OOP in my coursework. Thanks.

sequence Annovar • 2.2k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by drautuna ▴ 60

Ram · Accepted Answer · 2015-03-02

3

Entering edit mode

10.4 years ago

Sam ★ 4.8k

For me, I will just use the following awk code

awk '{print $1"\t"$2"\t"$2+length($4)-1"\t"$4"\t"$5"\t"$6}' <input file>

All this code does is to replace the third column of the input file to (length of string of the 4th column -1 + number in the second column). I am not sure if this will solve all your problem but you can definitely give it a go

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by Sam ★ 4.8k

0

Entering edit mode

Just what I needed, thanks.

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by drautuna ▴ 60