Question

How to efficiently get reference bases (hg19) for a list of SNV loci (e.g., from a MAF file)?

0

Entering edit mode

4.7 years ago

mrz132435 ▴ 20

I have an MAF file with about a couple million SNVs in it from various samples, and I'd like to verify that the file has the right reference bases. Since I'm not really looking for a 'sequence' it seems like querying Ensembl for each one wouldn't be vey efficient. Is there any straightforward way to just submit a table of chromosomes and positions and get the hg19 reference bases at those loci?

A somewhat related question (as it may be why I'm getting 'wrong reference base' errors for a tool I'm trying to use, motivating this post): when an MAF file has a 'start' and 'end' position for an SNV that differ by 1, which one is the actual position of the SNV? Thanks.

MAF SNV • 1.4k views

ADD COMMENT • link updated 4.7 years ago by manuel.belmadani ★ 1.3k • written 4.7 years ago by mrz132435 ▴ 20

0

Entering edit mode

duplicate: R: How to loop to extract all sequences within specific positions? ; Extract User Defined Region From An Fasta File ; ...

ADD REPLY • link 4.7 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Have no script ready now but the concept is actually quite simply:

Get the variants in BED format and then use bedtools getfasta to get the nucleotide for each position. I would simply make a two-column table with $1 being the nucleotide from the MAF and $2 the one from getfasta. A simple one-liner, e.g. awk 'OFS="\t" {if ($1 != $2) print NR}' file.txt would tell you then if there were any mismatches and in which row of the file so that you know where in the MAF file you have to clean up.

ADD REPLY • link 4.7 years ago by ATpoint 81k

score 0 · Answer 1 · 2019-07-25

I've used twoBitToFa: https://genome.ucsc.edu/goldenpath/help/twoBit.html

You can use bash scripting with xargs to check them in batch, otherwise I have a python snippet that calls it. You could loop through a list of variants and check them with the get_ref function below.

from subprocess import check_output
import os
import sys


class Fixer(object):

    def __init__(self):
        self.exe = "twoBitToFa"
        self.genome = "hg19.2bit"

    def get_ref(self, chromosome, start, stop):
        # Adjust coordinates from VCF input
        start_str = "-start={0}".format(str(start - 1))  # 0 based
        stop_str = "-end={0}".format(str(stop))  # Don't remove 1 here because it is [start,stop)
        chromosome_str = "-seq=chr{0}".format(chromosome)

        # Handle case where X,Y is reported as 23,24
        if chromosome_str == "-seq=chr23":
            chromosome_str = "-seq=chrX"
        elif chromosome_str == "-seq=chr24":
            chromosome_str = "-seq=chrY"

        # Run command
        output =  check_output(  [ self.exe, self.genome, 'stdout', chromosome_str, start_str, stop_str ]  )
        return "".join( output.split('\n')[1:] )