Question: python script to sort blast results and fetch best matches from NCBI
0
gravatar for skbrimer
5 months ago by
skbrimer500
United States
skbrimer500 wrote:

Hello hive mind,

I have been trying to automate a process that I do for flu sequencing. I currently use SPAdes to do a de novo build and then blast the contigs using the ncbi command line, remotely. I get back a tab delimited csv file that has the contig, the sequence id, the ncbi acc # and the raw score.

I have working a small script to pull just flu contigs

import re
import csv
import sys

with open(sys.argv[1], newline="") as csv_file, open("flu_hits.csv","w") as justFlu:
    reader = csv.reader(csv_file, delimiter="\t")
    writer = csv.writer(justFlu, delimiter="\t")
    for row in reader:
        if re.match(r'influenza',row[1], re.I) != None:
           writer.writerow(row)

this works just fine, however now I want to take the flu file and select the best match for each contig so I can make a list of the accession numbers to later then use to pull from ncbi to make a reference.

My current issue is trying to pull the highest score. I have some code (below) that is working but not the way I want. It is only returning the highest score from all the contigs not the highest score for each contig.

with open(sys.argv[1], newline="") as csv_file, open("temp.csv","w") as subset:
    reader = csv.reader(csv_file, delimiter="\t")
    writer = csv.writer(subset, delimiter="\t")
    for row in reader:
        if row[0] in NodeList:
            writer.writerow(row)
    with open("temp.csv", "r") as temp:
        subset_reader = csv.reader(temp, delimiter="\t")
        BestHit = max(subset_reader, key=lambda column: int(column[-1].replace(',','')))
        print(BestHit)

I do this manually now but it take a lot longer than it should, so I would really appreciate any direction with this.

Thank you in advance, Sean

ADD COMMENTlink modified 5 months ago • written 5 months ago by skbrimer500
1

Blast parser in biopython?

ADD REPLYlink written 5 months ago by genomax49k

see, this is why I come here. I may have been doing this the hard way

ADD REPLYlink written 5 months ago by skbrimer500
2
gravatar for skbrimer
5 months ago by
skbrimer500
United States
skbrimer500 wrote:

here is my solution!

import re
import csv
import sys
import pandas as pd

with open(sys.argv[1], newline="") as csv_file, open("flu_hits.csv","w") as justFlu:
    reader = csv.reader(csv_file, delimiter="\t")
    writer = csv.writer(justFlu, delimiter="\t")
    for row in reader:
        if re.match(r'influenza',row[1], re.I) != None:
           writer.writerow(row)

flu_matches = pd.read_csv("flu_hits.csv", sep="\t", header=None)

NodeList = {node for node in list(flu_matches[0])}

for node in NodeList:
    contig = flu_matches[0].str.contains(node)
    hits = list(flu_matches[contig].max())
    print(hits[-2])

I need to change the print function to a file to write to for further processing but it is doing want I need!

ADD COMMENTlink written 5 months ago by skbrimer500
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1003 users visited in the last hour