Question

python script to sort blast results and fetch best matches from NCBI

0

Entering edit mode

6.3 years ago

skbrimer ▴ 740

Hello hive mind,

I have been trying to automate a process that I do for flu sequencing. I currently use SPAdes to do a de novo build and then blast the contigs using the ncbi command line, remotely. I get back a tab delimited csv file that has the contig, the sequence id, the ncbi acc # and the raw score.

I have working a small script to pull just flu contigs

import re
import csv
import sys

with open(sys.argv[1], newline="") as csv_file, open("flu_hits.csv","w") as justFlu:
    reader = csv.reader(csv_file, delimiter="\t")
    writer = csv.writer(justFlu, delimiter="\t")
    for row in reader:
        if re.match(r'influenza',row[1], re.I) != None:
           writer.writerow(row)

this works just fine, however now I want to take the flu file and select the best match for each contig so I can make a list of the accession numbers to later then use to pull from ncbi to make a reference.

My current issue is trying to pull the highest score. I have some code (below) that is working but not the way I want. It is only returning the highest score from all the contigs not the highest score for each contig.

with open(sys.argv[1], newline="") as csv_file, open("temp.csv","w") as subset:
    reader = csv.reader(csv_file, delimiter="\t")
    writer = csv.writer(subset, delimiter="\t")
    for row in reader:
        if row[0] in NodeList:
            writer.writerow(row)
    with open("temp.csv", "r") as temp:
        subset_reader = csv.reader(temp, delimiter="\t")
        BestHit = max(subset_reader, key=lambda column: int(column[-1].replace(',','')))
        print(BestHit)

I do this manually now but it take a lot longer than it should, so I would really appreciate any direction with this.

Thank you in advance, Sean

Assembly alignment sequence python • 2.3k views

ADD COMMENT • link 6.3 years ago by skbrimer ▴ 740

1

Entering edit mode

Blast parser in biopython?

ADD REPLY • link 6.3 years ago by GenoMax 141k

0

Entering edit mode

see, this is why I come here. I may have been doing this the hard way

ADD REPLY • link 6.3 years ago by skbrimer ▴ 740

score 2 · Accepted Answer · 2018-01-05

here is my solution!

import re
import csv
import sys
import pandas as pd

with open(sys.argv[1], newline="") as csv_file, open("flu_hits.csv","w") as justFlu:
    reader = csv.reader(csv_file, delimiter="\t")
    writer = csv.writer(justFlu, delimiter="\t")
    for row in reader:
        if re.match(r'influenza',row[1], re.I) != None:
           writer.writerow(row)

flu_matches = pd.read_csv("flu_hits.csv", sep="\t", header=None)

NodeList = {node for node in list(flu_matches[0])}

for node in NodeList:
    contig = flu_matches[0].str.contains(node)
    hits = list(flu_matches[contig].max())
    print(hits[-2])

I need to change the print function to a file to write to for further processing but it is doing want I need!