How to efficiently get reference bases (hg19) for a list of SNV loci (e.g., from a MAF file)?
1
0
Entering edit mode
3.2 years ago
mrz132435 ▴ 20

I have an MAF file with about a couple million SNVs in it from various samples, and I'd like to verify that the file has the right reference bases. Since I'm not really looking for a 'sequence' it seems like querying Ensembl for each one wouldn't be vey efficient. Is there any straightforward way to just submit a table of chromosomes and positions and get the hg19 reference bases at those loci?

A somewhat related question (as it may be why I'm getting 'wrong reference base' errors for a tool I'm trying to use, motivating this post): when an MAF file has a 'start' and 'end' position for an SNV that differ by 1, which one is the actual position of the SNV? Thanks.

MAF SNV • 1.0k views
0
Entering edit mode
0
Entering edit mode

Have no script ready now but the concept is actually quite simply:

Get the variants in BED format and then use bedtools getfasta to get the nucleotide for each position. I would simply make a two-column table with $1 being the nucleotide from the MAF and$2 the one from getfasta. A simple one-liner, e.g. awk 'OFS="\t" {if ($1 !=$2) print NR}' file.txt would tell you then if there were any mismatches and in which row of the file so that you know where in the MAF file you have to clean up.

0
Entering edit mode
3.2 years ago

I've used twoBitToFa: https://genome.ucsc.edu/goldenpath/help/twoBit.html

You can use bash scripting with xargs to check them in batch, otherwise I have a python snippet that calls it. You could loop through a list of variants and check them with the get_ref function below.

from subprocess import check_output
import os
import sys

class Fixer(object):

def __init__(self):
self.exe = "twoBitToFa"
self.genome = "hg19.2bit"

def get_ref(self, chromosome, start, stop):
# Adjust coordinates from VCF input
start_str = "-start={0}".format(str(start - 1))  # 0 based
stop_str = "-end={0}".format(str(stop))  # Don't remove 1 here because it is [start,stop)
chromosome_str = "-seq=chr{0}".format(chromosome)

# Handle case where X,Y is reported as 23,24
if chromosome_str == "-seq=chr23":
chromosome_str = "-seq=chrX"
elif chromosome_str == "-seq=chr24":
chromosome_str = "-seq=chrY"

# Run command
output =  check_output(  [ self.exe, self.genome, 'stdout', chromosome_str, start_str, stop_str ]  )
return "".join( output.split('\n')[1:] )