Hello hive mind,
I have been trying to automate a process that I do for flu sequencing. I currently use SPAdes to do a de novo build and then blast the contigs using the ncbi command line, remotely. I get back a tab delimited csv file that has the contig, the sequence id, the ncbi acc # and the raw score.
I have working a small script to pull just flu contigs
import re import csv import sys with open(sys.argv, newline="") as csv_file, open("flu_hits.csv","w") as justFlu: reader = csv.reader(csv_file, delimiter="\t") writer = csv.writer(justFlu, delimiter="\t") for row in reader: if re.match(r'influenza',row, re.I) != None: writer.writerow(row)
this works just fine, however now I want to take the flu file and select the best match for each contig so I can make a list of the accession numbers to later then use to pull from ncbi to make a reference.
My current issue is trying to pull the highest score. I have some code (below) that is working but not the way I want. It is only returning the highest score from all the contigs not the highest score for each contig.
with open(sys.argv, newline="") as csv_file, open("temp.csv","w") as subset: reader = csv.reader(csv_file, delimiter="\t") writer = csv.writer(subset, delimiter="\t") for row in reader: if row in NodeList: writer.writerow(row) with open("temp.csv", "r") as temp: subset_reader = csv.reader(temp, delimiter="\t") BestHit = max(subset_reader, key=lambda column: int(column[-1].replace(',',''))) print(BestHit)
I do this manually now but it take a lot longer than it should, so I would really appreciate any direction with this.
Thank you in advance, Sean