Question: Convert indel list with [-/A] notation to VCF with adjacent base
1
gravatar for Chris Miller
10 weeks ago by
Chris Miller20k
Washington University in St. Louis, MO
Chris Miller20k wrote:

I couldn't find a previous answer to this question, but have to believe something exists!

I'm trying to take a list of mutations that looks like this:

chr1    120995  120996  -/TAT
chr1    1090526 1090526 A/C
chr1    2856178 2856178 G/C
chr1    3975090 3975090 G/A
chr1    4420314 4420315 TA/-

And convert it to a bare-bones VCF so that I can run some annotation tools. Doing SNVs is easy enough - it's just shuffling some fields, adding a 0/1 GT column, etc.

Indels are more tricky, since proper VCF needs the adjacent base:

-/TAT   ->    A      ATAT
TA/-    ->    GTA    G

I'm sure I could whip something up hacky with samtools faidx, but someone has to have solved this problem already, right?

convert faidx bed vcf • 173 views
ADD COMMENTlink modified 10 weeks ago by finswimmer11k • written 10 weeks ago by Chris Miller20k

Indels are more tricky, since proper VCF needs the adjacent base

I think bcftools norm should do the trick with something like (not tested at all):

awk 'code to shuffle columns into vcf' $infile \
| bcftools norm -c s -f ref.fa > outfile.vcf
ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by dariober9.9k
0
gravatar for Chris Miller
10 weeks ago by
Chris Miller20k
Washington University in St. Louis, MO
Chris Miller20k wrote:

One option seems to be converting to VEP format, with something like this:

cut -f 1-5 $YOURFILE | perl -nae 'if($F[3] eq "-"){$F[2]=$F[1];$F[1]=$F[1]+1;};print join("\t",(@F[0..2],$F[3] . "/" . $F[4])) . "\n"'

(assumes input is tsv like chr1 123 124 - AA)

Then annotating with VEP to get a nicely formed VCF. Still interested in other answers that don't require a full annotation step.

ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by Chris Miller20k
0
gravatar for finswimmer
10 weeks ago by
finswimmer11k
Germany
finswimmer11k wrote:

My first thought was also bcftools norm like dariober suggested. But I couldn't make it work, to handle - or . as REF/ALT.

So here is a solution with awk, samtools faidx and parallel.

First prepare the header data for a valid vcf file:

$ echo "##fileformat=VCFv4.2" > output.vcf  
$ awk '{print "##contig=<ID="$1",length="$2">" >> "output.vcf"}' genome.fa.fai
$ echo "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"  >> output.vcf

Now let's do some magic with awk, samtools faidx and parallel. What has to be done, is to get the reference base before the position reported in your file, to convert -/TAT to A/ATAT or TA/- to GTA/G

$ awk -v OFS="\t" '{print $1, $2-1, $2, $4}' mutations.txt | parallel --colsep "\t" 'samtools faidx genome.fa {1}:{2}-{2} | awk -v OFS="\t" -v chr={1} -v start0={2} -v start1={3} -v genotype={4} -f makevcf.awk'

The makevcf.awk looks like this:

NR==2 { 
    split(genotype, gt, "/");
    if (gt[1] == "-") {
        print chr, start0, ".", $1, $1gt[2], ".", ".", ".";
    } else if (gt[2] == "-") {
        print chr, start0, ".", $1gt[1], $1, ".", ".", ".";
    } else {
        print chr, start1, ".", gt[1], gt[2], ".", ".", ".";
    }
}

I would recommend to run bcftools sortand bcftools norm on the new vcf file:

$ bcftools sort output.vcf|bcftools norm -f genome.fa > final.vcf

fin swimmer

EDIT: I assume that the first value of the given genotype is REF and the second ALT. But is this correct?

ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by finswimmer11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2120 users visited in the last hour