4 months ago
Kristina

I'm new to programing and I'm currently working on my thesis.

I'm working with multiple csv files and a json file containing genes with amino acid changes involved in antibiotic resistance. The csv files are formatted like this:

Gene_Aminoacids Filename
gyrA_S95T   SRR9851427
tlyA_L11L   SRR9851427
katG_R463L  SRR9851427


In the json file the genes are present as keys, and the corresponding antibiotic which it effects are set as values.

Ex small part of json file.

"gyrA_A74S" : ["Quinolones"],
"gyrA_D89X" : ["Quinolones"],
"tlyA_C-83T" : ["Capreomycin"]
"katG_R104Q" : ["Isoniazid"],
"katG_S315I" : ["Isoniazid"],
"katG_S315N" : ["Isoniazid"],
etc....


What I'm interested in is finding matching (keys) genes from the json file and the csv files. I'm interested in a new output that should contain the keys that are found in both json & csv file, which is the genes, and the corresponding antibiotic (value) .

Ex of the wanted output

 Gene_Aminoacids Antibiotic  Filename
"katG_R104Q" : ["Isoniazid"], SRR9851427


So far this is the code that I have written and I have looked into similar issues but they didn't work on my data.

def retrive_rest_mutations(jsonfile):
with open(jsonfile) as data_file:
return(data.keys())


mutation_keys = retrive_rest_mutations("tb_TEST.json")

##Read & set path to folder containing a.a changes

path = "Replaced_P_G.ann.vcf"
samp = glob.glob(path + "/*_G.P.vcf_replaced.txt")

result = []

with open(file_path, 'r') as f:

##iterate through all files
def all_files():
for file in os.listdir():
if file.endswith(".txt"):
file_path = f"{samp}/{file}"
print("\n")


The code might be wrongly indented due to that i copied it I'm uncertain on how to do the matching between the json file and the multiple csv files and there might be a simple solution to my issue.

Dose anyone maybe have a suggestion, or what I should look into to get the new output containing the Genes + Antibiotic + Filename?

Best regards

4 months ago
Shred

Based on what you've asked, this might work.

import glob
import json

files = glob.glob('*.csv')
for n in files:
with open(n,'r') as iput:
for line in iput:
gene,filename = line.split('\t')
# use a try to handle KeyError
try:
antibiotic = jsonfile[gene]
# found, now print
print(f"{gene}:{antibiotic},{filename}")
except KeyError:
continue
`