Question: (Closed) Annovar: how to update sequence length based on deletion size
0
gravatar for drautuna
4.1 years ago by
drautuna60
United States
drautuna60 wrote:

Hi, so I'm using Annovar for its gene annotation capabilities and I need some help changing how the input is setup.

 

The complete command I run on Annovar for its gene annotation capabilities is:

     ./annotate_variation.pl -geneanno -buildver hg19 MyData.avinput humandb

Here is a small example portion of my data

1       101814       101814       G       T      rs1231
1       1018940       1018940       T       C       rs546754564
1       1020131       1020131       A       -       rs234324
1       1032136       1032136       -       T       rs21313
1       1020514       1020514       T       G      rs645654
1       1022394       1022394       C       G      rs4356354
1       9023126      9023126       TA       -       rs4542342
1       10270690       10270690       CTCA     -       rs3275676

Where the first two variants are a simple base substitution, the third variant is a deletion of that base, and the last two variants are deletions of those bases. Here, all of the variants will get annotated, besides the last two variants, which produce an invalid_input error. This is because the last two sequence ranges must reflect the length of the DNA being deleted, in this case, 2bp and 4bp respectively. 

In order to fix it, we'd have to make the last two lines say

1       9023126      9023127       TA       -       
1       10270690       10270693       CTCA     - 

To properly reflect the length of the sequence being deleted.

The problem is, my mentor gave me the data in the such erroneous format, with many many variants in this form, so I cannot do it manually. How might i do this computationally?

 

I know the psuedocode for such a problem would first

1) check if it was a deletion by

a)checking if the 5th column is a minus "-" character for that row, and then

b)checking the 4th column in that same row, if (a) was true, and seeing if it was a string of letters

if the latter is true, then

2) check how many letters the 4th column is, call that value "n"

3) add n-1 to the value in column 3.

How might I carry this out computationally on UNIX? I'm still kind of a novice at bioinformatics, but I'm pretty decent with OOP in my coursework. Thanks.

 

 

 

ADD COMMENTlink modified 4.1 years ago by Sam2.3k • written 4.1 years ago by drautuna60

Hello drautuna!

We believe that this post does not fit the main topic of this site.

Problem solved

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.

Cheers!

ADD REPLYlink written 4.1 years ago by drautuna60
3
gravatar for Sam
4.1 years ago by
Sam2.3k
New York
Sam2.3k wrote:

For me, I will just use the following awk code

awk '{print $1"\t"$2"\t"$2+length($4)-1"\t"$4"\t"$5"\t"$6}' <input file>

All this code does is to replace the third column of the input file to  (length of string of the 4th column -1 + number in the second column). I am not sure if this will solve all your problem but you can definitely give it a go

ADD COMMENTlink written 4.1 years ago by Sam2.3k

Just what i needed, thanks.

ADD REPLYlink written 4.1 years ago by drautuna60
Please log in to add an answer.
The thread is closed. No new answers may be added.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1108 users visited in the last hour