Question: Convert indel list with [-/A] notation to VCF with adjacent base
1
6 months ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

I couldn't find a previous answer to this question, but have to believe something exists!

I'm trying to take a list of mutations that looks like this:

``````chr1    120995  120996  -/TAT
chr1    1090526 1090526 A/C
chr1    2856178 2856178 G/C
chr1    3975090 3975090 G/A
chr1    4420314 4420315 TA/-
``````

And convert it to a bare-bones VCF so that I can run some annotation tools. Doing SNVs is easy enough - it's just shuffling some fields, adding a 0/1 GT column, etc.

Indels are more tricky, since proper VCF needs the adjacent base:

``````-/TAT   ->    A      ATAT
TA/-    ->    GTA    G
``````

I'm sure I could whip something up hacky with samtools faidx, but someone has to have solved this problem already, right?

convert faidx bed vcf • 303 views
modified 3 months ago by or.yaacov0 • written 6 months ago by Chris Miller21k

Indels are more tricky, since proper VCF needs the adjacent base

I think bcftools norm should do the trick with something like (not tested at all):

``````awk 'code to shuffle columns into vcf' \$infile \
| bcftools norm -c s -f ref.fa > outfile.vcf
``````
0
6 months ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

One option seems to be converting to VEP format, with something like this:

``````cut -f 1-5 \$YOURFILE | perl -nae 'if(\$F[3] eq "-"){\$F[2]=\$F[1];\$F[1]=\$F[1]+1;};print join("\t",(@F[0..2],\$F[3] . "/" . \$F[4])) . "\n"'
``````

(assumes input is tsv like `chr1 123 124 - AA`)

Then annotating with VEP to get a nicely formed VCF. Still interested in other answers that don't require a full annotation step.

0
6 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

My first thought was also `bcftools norm` like dariober suggested. But I couldn't make it work, to handle `-` or `.` as REF/ALT.

So here is a solution with `awk`, `samtools faidx` and `parallel`.

First prepare the header data for a valid vcf file:

``````\$ echo "##fileformat=VCFv4.2" > output.vcf
\$ awk '{print "##contig=<ID="\$1",length="\$2">" >> "output.vcf"}' genome.fa.fai
\$ echo "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"  >> output.vcf
``````

Now let's do some magic with `awk`, `samtools faidx` and `parallel`. What has to be done, is to get the reference base before the position reported in your file, to convert `-/TAT` to `A/ATAT` or `TA/-` to `GTA/G`

``````\$ awk -v OFS="\t" '{print \$1, \$2-1, \$2, \$4}' mutations.txt | parallel --colsep "\t" 'samtools faidx genome.fa {1}:{2}-{2} | awk -v OFS="\t" -v chr={1} -v start0={2} -v start1={3} -v genotype={4} -f makevcf.awk'
``````

The `makevcf.awk` looks like this:

``````NR==2 {
split(genotype, gt, "/");
if (gt[1] == "-") {
print chr, start0, ".", \$1, \$1gt[2], ".", ".", ".";
} else if (gt[2] == "-") {
print chr, start0, ".", \$1gt[1], \$1, ".", ".", ".";
} else {
print chr, start1, ".", gt[1], gt[2], ".", ".", ".";
}
}
``````

I would recommend to run `bcftools sort`and `bcftools norm` on the new vcf file:

``````\$ bcftools sort output.vcf|bcftools norm -f genome.fa > final.vcf
``````

fin swimmer

EDIT: I assume that the first value of the given genotype is REF and the second ALT. But is this correct?

0
3 months ago by
or.yaacov0 wrote:

Hi, I couldn't find one so I wrote a simple R script that does exactly that- converting from a (A -> "-") notation to proper VCF notation (GA -> G) with a leading base:

https://github.com/orr161/VCF-indel-converter (available as a notebook w instructions too)

Basically, using BSgenome (of Bioconductor) to create a function that can fetch the leading base/sequence from hg19 (can be changed):

``````findbase <- function(chr, pos, len=0) { letter <- toString(Hsapiens[[chr]][pos:(pos+len)]) return(letter) }
``````

and then iterating over the file as a dataframe.

Or.