Get gVCF file from genome coordinates

0

Entering edit mode

8.9 years ago

Paul ★ 1.5k

Dear all,

I have bed file like:

Chr \t start \t stop

And I woud like to add to fourth column information about reference allele from hg19, so output for example should look like:

chr \t start \t stop \t A/A
chr \t start \t stop \t C/C
chr \t start \t stop \t T/T
chr \t start \t stop \t A/A

So create something very similar to gVCF. Reason is, that I need to annotate this output file in VEP.

gVCF coorinates hg19 • 2.2k views

ADD COMMENT • link updated 8.9 years ago by WouterDeCoster 48k • written 8.9 years ago by Paul ★ 1.5k

0

Entering edit mode

Do you know a bit of Python? I have a script which almost does what you want.

ADD REPLY • link 8.9 years ago by WouterDeCoster 48k

0

Entering edit mode

Thank you for reply. I am scripting more in bash and awk.. But maybe I would understand..

ADD REPLY • link 8.9 years ago by Paul ★ 1.5k

0

Entering edit mode

How big is your file? Wondering whether to do everything in memory or stream.

ADD REPLY • link 8.9 years ago by WouterDeCoster 48k

0

Entering edit mode

It could be about 500000 rows and about 20 MB.

ADD REPLY • link 8.9 years ago by Paul ★ 1.5k

1

Entering edit mode

8.9 years ago

WouterDeCoster 48k

This should do the trick, let me know if it doesn't :) The script uses as arguments i) the file to add reference nucleotides to ii) a directory containing single chromosome fasta files named like chr1.fa. Obviously you can change this

Let me know if something is not as it should be.

	#wdecoster

	import sys, os
	from Bio import SeqIO

	def createDict(tsvfile):
	'''Input is a tsv file and returned is a dictionary with key=chromosome and value=list of tuples with positions'''
	if os.path.isfile(tsvfile):
	chrdict = { 'chr1' : [], 'chr2' : [], 'chr3' : [], 'chr4' : [], 'chr5' : [], 'chr6' : [],
	'chr7' : [], 'chr8' : [], 'chr9' : [], 'chr10' : [],'chr11' : [], 'chr12' : [],
	'chr13' : [], 'chr14' : [], 'chr15' : [], 'chr16' : [], 'chr17' : [], 'chr18' : [],
	'chr19' :[], 'chr20' : [], 'chr21' : [], 'chr22' : [], 'chrX' : [], 'chrY' : [], 'chrM' : []}
	for key in chrdict.keys():
	if not os.path.isfile(dbdir + key + '.fa'):
	sys.exit("Couldn't find fasta file for {}, expected {} in {}".format(dictkey, dictkey + '.fa', dbdir))
	with open(tsvfile, 'r') as input:
	for line in input.readlines():
	linelist = line.strip().split('\t')
	try:
	chrdict['chr' + linelist[0]].append((int(linelist[1]), int(linelist[2])))
	except KeyError:
	try:
	chrdict[linelist[0]].append((int(linelist[1]), int(linelist[2])))
	except KeyError:
	if line.startswith('#'):
	continue
	if line.startswith('chromosome\tbegin\tend\ttype\tref'):
	continue
	else:
	sys.exit("Unexpected input line: {}".format(' '.join(linelist[:4])))
	except ValueError:
	sys.exit("Unexpected input line: {}".format('\t'.join(linelist[:4])))
	for key, value in chrdict.items():
	if len(value) == 0:
	del chrdict[key]
	return(chrdict)
	else:
	sys.exit("The path to {} is incorrect or the file doesn't exist".format(tsvfile))

	def natural_sort(l):
	import re
	'''Performs natural sorting of the input list, suitable for chromosomes'''
	convert = lambda text: int(text) if text.isdigit() else text.lower()
	alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
	return sorted(l, key = alphanum_key)

	def addrefnucl(dictkey):
	fasta = SeqIO.read(dbdir + dictkey + '.fa', "fasta")
	for position in tsvdict[dictkey]:
	referencenucleotide = fasta.seq[position[0]:position[1]].upper()
	print("{}".format('\t'.join([dictkey, str(position[0]), str(position[1]), str(referencenucleotide) + '/' + str(referencenucleotide)])))

	dbdir = sys.argv[2]
	tsvdict = createDict(sys.argv[1])

	for key in natural_sort(tsvdict.keys()):
	addrefnucl(key)

view raw addrefnucl.py hosted with ❤ by GitHub

ADD COMMENT • link 8.9 years ago by WouterDeCoster 48k

0

Entering edit mode

WouterDeCoster Thank you for very nice code. I am trying to run thi code but still it failed. My syntax:

python anotate.py input.tsv  path/to/chr/directory

And I have error message:

File "anotate.py", line 57 addrefnucl(key) ^ IndentationError: expected an indented block

Please if you have any idea what I am doing wrong. Thank you so much.

ADD REPLY • link 8.9 years ago by Paul ★ 1.5k

0

Entering edit mode

There is a tab error, maybe what I wrote or copy paste. I'm at a conference and will look into this asap.

ADD REPLY • link 8.9 years ago by WouterDeCoster 48k

0

Entering edit mode

If that's the complete error message, you did something weird while copying the file. Could you check that the final line is properly indented with a tab (as in my post above). If there is a longer error message, I would like to see that.

ADD REPLY • link 8.8 years ago by WouterDeCoster 48k

Login before adding your answer.