Question: python script to sort blast results and fetch best matches from NCBI
0
gravatar for skbrimer
11 days ago by
skbrimer440
United States
skbrimer440 wrote:

Hello hive mind,

I have been trying to automate a process that I do for flu sequencing. I currently use SPAdes to do a de novo build and then blast the contigs using the ncbi command line, remotely. I get back a tab delimited csv file that has the contig, the sequence id, the ncbi acc # and the raw score.

I have working a small script to pull just flu contigs

import re
import csv
import sys

with open(sys.argv[1], newline="") as csv_file, open("flu_hits.csv","w") as justFlu:
    reader = csv.reader(csv_file, delimiter="\t")
    writer = csv.writer(justFlu, delimiter="\t")
    for row in reader:
        if re.match(r'influenza',row[1], re.I) != None:
           writer.writerow(row)

this works just fine, however now I want to take the flu file and select the best match for each contig so I can make a list of the accession numbers to later then use to pull from ncbi to make a reference.

My current issue is trying to pull the highest score. I have some code (below) that is working but not the way I want. It is only returning the highest score from all the contigs not the highest score for each contig.

with open(sys.argv[1], newline="") as csv_file, open("temp.csv","w") as subset:
    reader = csv.reader(csv_file, delimiter="\t")
    writer = csv.writer(subset, delimiter="\t")
    for row in reader:
        if row[0] in NodeList:
            writer.writerow(row)
    with open("temp.csv", "r") as temp:
        subset_reader = csv.reader(temp, delimiter="\t")
        BestHit = max(subset_reader, key=lambda column: int(column[-1].replace(',','')))
        print(BestHit)

I do this manually now but it take a lot longer than it should, so I would really appreciate any direction with this.

Thank you in advance, Sean

ADD COMMENTlink modified 11 days ago • written 11 days ago by skbrimer440
1

Blast parser in biopython?

ADD REPLYlink written 11 days ago by genomax40k

see, this is why I come here. I may have been doing this the hard way

ADD REPLYlink written 11 days ago by skbrimer440
2
gravatar for skbrimer
11 days ago by
skbrimer440
United States
skbrimer440 wrote:

here is my solution!

import re
import csv
import sys
import pandas as pd

with open(sys.argv[1], newline="") as csv_file, open("flu_hits.csv","w") as justFlu:
    reader = csv.reader(csv_file, delimiter="\t")
    writer = csv.writer(justFlu, delimiter="\t")
    for row in reader:
        if re.match(r'influenza',row[1], re.I) != None:
           writer.writerow(row)

flu_matches = pd.read_csv("flu_hits.csv", sep="\t", header=None)

NodeList = {node for node in list(flu_matches[0])}

for node in NodeList:
    contig = flu_matches[0].str.contains(node)
    hits = list(flu_matches[contig].max())
    print(hits[-2])

I need to change the print function to a file to write to for further processing but it is doing want I need!

ADD COMMENTlink written 11 days ago by skbrimer440
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1265 users visited in the last hour