Question: Convert indel list with [-/A] notation to VCF with adjacent base
1
gravatar for Chris Miller
6 months ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

I couldn't find a previous answer to this question, but have to believe something exists!

I'm trying to take a list of mutations that looks like this:

chr1    120995  120996  -/TAT
chr1    1090526 1090526 A/C
chr1    2856178 2856178 G/C
chr1    3975090 3975090 G/A
chr1    4420314 4420315 TA/-

And convert it to a bare-bones VCF so that I can run some annotation tools. Doing SNVs is easy enough - it's just shuffling some fields, adding a 0/1 GT column, etc.

Indels are more tricky, since proper VCF needs the adjacent base:

-/TAT   ->    A      ATAT
TA/-    ->    GTA    G

I'm sure I could whip something up hacky with samtools faidx, but someone has to have solved this problem already, right?

convert faidx bed vcf • 303 views
ADD COMMENTlink modified 3 months ago by or.yaacov0 • written 6 months ago by Chris Miller21k

Indels are more tricky, since proper VCF needs the adjacent base

I think bcftools norm should do the trick with something like (not tested at all):

awk 'code to shuffle columns into vcf' $infile \
| bcftools norm -c s -f ref.fa > outfile.vcf
ADD REPLYlink modified 6 months ago • written 6 months ago by dariober10k
0
gravatar for Chris Miller
6 months ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

One option seems to be converting to VEP format, with something like this:

cut -f 1-5 $YOURFILE | perl -nae 'if($F[3] eq "-"){$F[2]=$F[1];$F[1]=$F[1]+1;};print join("\t",(@F[0..2],$F[3] . "/" . $F[4])) . "\n"'

(assumes input is tsv like chr1 123 124 - AA)

Then annotating with VEP to get a nicely formed VCF. Still interested in other answers that don't require a full annotation step.

ADD COMMENTlink modified 6 months ago • written 6 months ago by Chris Miller21k
0
gravatar for finswimmer
6 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

My first thought was also bcftools norm like dariober suggested. But I couldn't make it work, to handle - or . as REF/ALT.

So here is a solution with awk, samtools faidx and parallel.

First prepare the header data for a valid vcf file:

$ echo "##fileformat=VCFv4.2" > output.vcf  
$ awk '{print "##contig=<ID="$1",length="$2">" >> "output.vcf"}' genome.fa.fai
$ echo "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"  >> output.vcf

Now let's do some magic with awk, samtools faidx and parallel. What has to be done, is to get the reference base before the position reported in your file, to convert -/TAT to A/ATAT or TA/- to GTA/G

$ awk -v OFS="\t" '{print $1, $2-1, $2, $4}' mutations.txt | parallel --colsep "\t" 'samtools faidx genome.fa {1}:{2}-{2} | awk -v OFS="\t" -v chr={1} -v start0={2} -v start1={3} -v genotype={4} -f makevcf.awk'

The makevcf.awk looks like this:

NR==2 { 
    split(genotype, gt, "/");
    if (gt[1] == "-") {
        print chr, start0, ".", $1, $1gt[2], ".", ".", ".";
    } else if (gt[2] == "-") {
        print chr, start0, ".", $1gt[1], $1, ".", ".", ".";
    } else {
        print chr, start1, ".", gt[1], gt[2], ".", ".", ".";
    }
}

I would recommend to run bcftools sortand bcftools norm on the new vcf file:

$ bcftools sort output.vcf|bcftools norm -f genome.fa > final.vcf

fin swimmer

EDIT: I assume that the first value of the given genotype is REF and the second ALT. But is this correct?

ADD COMMENTlink modified 6 months ago • written 6 months ago by finswimmer11k
0
gravatar for or.yaacov
3 months ago by
or.yaacov0
or.yaacov0 wrote:

Hi, I couldn't find one so I wrote a simple R script that does exactly that- converting from a (A -> "-") notation to proper VCF notation (GA -> G) with a leading base:

https://github.com/orr161/VCF-indel-converter (available as a notebook w instructions too)

Basically, using BSgenome (of Bioconductor) to create a function that can fetch the leading base/sequence from hg19 (can be changed):

findbase <- function(chr, pos, len=0) { letter <- toString(Hsapiens[[chr]][pos:(pos+len)]) return(letter) }

and then iterating over the file as a dataframe.

Or.

ADD COMMENTlink written 3 months ago by or.yaacov0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2164 users visited in the last hour