Get gVCF file from genome coordinates
1
0
Entering edit mode
8.8 years ago
Paul ★ 1.5k

Dear all,

I have bed file like:

Chr \t start \t stop

And I woud like to add to fourth column information about reference allele from hg19, so output for example should look like:

chr \t start \t stop \t A/A
chr \t start \t stop \t C/C
chr \t start \t stop \t T/T
chr \t start \t stop \t A/A

So create something very similar to gVCF. Reason is, that I need to annotate this output file in VEP.

gVCF coorinates hg19 • 2.2k views
ADD COMMENT
0
Entering edit mode

Do you know a bit of Python? I have a script which almost does what you want.

ADD REPLY
0
Entering edit mode

Thank you for reply. I am scripting more in bash and awk.. But maybe I would understand..

ADD REPLY
0
Entering edit mode

How big is your file? Wondering whether to do everything in memory or stream.

ADD REPLY
0
Entering edit mode

It could be about 500000 rows and about 20 MB.

ADD REPLY
1
Entering edit mode
8.7 years ago

This should do the trick, let me know if it doesn't :) The script uses as arguments i) the file to add reference nucleotides to ii) a directory containing single chromosome fasta files named like chr1.fa. Obviously you can change this

Let me know if something is not as it should be.

#wdecoster
import sys, os
from Bio import SeqIO
def createDict(tsvfile):
'''Input is a tsv file and returned is a dictionary with key=chromosome and value=list of tuples with positions'''
if os.path.isfile(tsvfile):
chrdict = { 'chr1' : [], 'chr2' : [], 'chr3' : [], 'chr4' : [], 'chr5' : [], 'chr6' : [],
'chr7' : [], 'chr8' : [], 'chr9' : [], 'chr10' : [],'chr11' : [], 'chr12' : [],
'chr13' : [], 'chr14' : [], 'chr15' : [], 'chr16' : [], 'chr17' : [], 'chr18' : [],
'chr19' :[], 'chr20' : [], 'chr21' : [], 'chr22' : [], 'chrX' : [], 'chrY' : [], 'chrM' : []}
for key in chrdict.keys():
if not os.path.isfile(dbdir + key + '.fa'):
sys.exit("Couldn't find fasta file for {}, expected {} in {}".format(dictkey, dictkey + '.fa', dbdir))
with open(tsvfile, 'r') as input:
for line in input.readlines():
linelist = line.strip().split('\t')
try:
chrdict['chr' + linelist[0]].append((int(linelist[1]), int(linelist[2])))
except KeyError:
try:
chrdict[linelist[0]].append((int(linelist[1]), int(linelist[2])))
except KeyError:
if line.startswith('#'):
continue
if line.startswith('chromosome\tbegin\tend\ttype\tref'):
continue
else:
sys.exit("Unexpected input line: {}".format(' '.join(linelist[:4])))
except ValueError:
sys.exit("Unexpected input line: {}".format('\t'.join(linelist[:4])))
for key, value in chrdict.items():
if len(value) == 0:
del chrdict[key]
return(chrdict)
else:
sys.exit("The path to {} is incorrect or the file doesn't exist".format(tsvfile))
def natural_sort(l):
import re
'''Performs natural sorting of the input list, suitable for chromosomes'''
convert = lambda text: int(text) if text.isdigit() else text.lower()
alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
return sorted(l, key = alphanum_key)
def addrefnucl(dictkey):
fasta = SeqIO.read(dbdir + dictkey + '.fa', "fasta")
for position in tsvdict[dictkey]:
referencenucleotide = fasta.seq[position[0]:position[1]].upper()
print("{}".format('\t'.join([dictkey, str(position[0]), str(position[1]), str(referencenucleotide) + '/' + str(referencenucleotide)])))
dbdir = sys.argv[2]
tsvdict = createDict(sys.argv[1])
for key in natural_sort(tsvdict.keys()):
addrefnucl(key)
view raw addrefnucl.py hosted with ❤ by GitHub

ADD COMMENT
0
Entering edit mode

WouterDeCoster Thank you for very nice code. I am trying to run thi code but still it failed. My syntax:

python anotate.py input.tsv  path/to/chr/directory

And I have error message:

File "anotate.py", line 57 addrefnucl(key) ^ IndentationError: expected an indented block

Please if you have any idea what I am doing wrong. Thank you so much.

ADD REPLY
0
Entering edit mode

There is a tab error, maybe what I wrote or copy paste. I'm at a conference and will look into this asap.

ADD REPLY
0
Entering edit mode

If that's the complete error message, you did something weird while copying the file. Could you check that the final line is properly indented with a tab (as in my post above). If there is a longer error message, I would like to see that.

ADD REPLY

Login before adding your answer.

Traffic: 10737 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6