Question: How to efficiently get reference bases (hg19) for a list of SNV loci (e.g., from a MAF file)?
gravatar for mrz132435
16 months ago by
mrz13243520 wrote:

I have an MAF file with about a couple million SNVs in it from various samples, and I'd like to verify that the file has the right reference bases. Since I'm not really looking for a 'sequence' it seems like querying Ensembl for each one wouldn't be vey efficient. Is there any straightforward way to just submit a table of chromosomes and positions and get the hg19 reference bases at those loci?

A somewhat related question (as it may be why I'm getting 'wrong reference base' errors for a tool I'm trying to use, motivating this post): when an MAF file has a 'start' and 'end' position for an SNV that differ by 1, which one is the actual position of the SNV? Thanks.

snv maf • 375 views
ADD COMMENTlink modified 16 months ago by manuel.belmadani1.2k • written 16 months ago by mrz13243520

Have no script ready now but the concept is actually quite simply:

Get the variants in BED format and then use bedtools getfasta to get the nucleotide for each position. I would simply make a two-column table with $1 being the nucleotide from the MAF and $2 the one from getfasta. A simple one-liner, e.g. awk 'OFS="\t" {if ($1 != $2) print NR}' file.txt would tell you then if there were any mismatches and in which row of the file so that you know where in the MAF file you have to clean up.

ADD REPLYlink modified 16 months ago • written 16 months ago by ATpoint42k
gravatar for manuel.belmadani
16 months ago by
manuel.belmadani1.2k wrote:

I've used twoBitToFa:

You can use bash scripting with xargs to check them in batch, otherwise I have a python snippet that calls it. You could loop through a list of variants and check them with the get_ref function below.

from subprocess import check_output
import os
import sys

class Fixer(object):

    def __init__(self):
        self.exe = "twoBitToFa"
        self.genome = "hg19.2bit"

    def get_ref(self, chromosome, start, stop):
        # Adjust coordinates from VCF input
        start_str = "-start={0}".format(str(start - 1))  # 0 based
        stop_str = "-end={0}".format(str(stop))  # Don't remove 1 here because it is [start,stop)
        chromosome_str = "-seq=chr{0}".format(chromosome)

        # Handle case where X,Y is reported as 23,24
        if chromosome_str == "-seq=chr23":
            chromosome_str = "-seq=chrX"
        elif chromosome_str == "-seq=chr24":
            chromosome_str = "-seq=chrY"

        # Run command
        output =  check_output(  [ self.exe, self.genome, 'stdout', chromosome_str, start_str, stop_str ]  )
        return "".join( output.split('\n')[1:] )
ADD COMMENTlink written 16 months ago by manuel.belmadani1.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1024 users visited in the last hour