I'm trying to extract nucleotide sequences from a draft genome based on alignment coordinates from a SAM file. Right now, I'm opening the SAM file as a csv, then matching sequence name and finally grabbing the sequence based on the coordinates.
import csv import re from Bio import SeqIO bam_file=open('orfH9_offtarget.sam', 'rb') bam_reader=csv.reader(bam_file,delimiter='\t') offtarget_results = open('offtarget_orfH9.txt', 'w') unique_seq_names= parsed_data= num= for row in bam_reader: if row: genome_seq=SeqIO.parse(open('genome.fasta', 'rU'), 'fasta') for contig in genome_seq: if row in contig.id: if int(row) == 256: i = int(row) num.append(contig.seq[i:i+20]) elif int(row) == 272: i = int(row) num.append(contig.seq.reverse_complement()[i:i+20]) elif int(row) == 0: i = int(row) num.append(contig.seq[i:i+20]) elif int(row) == 16: i = int(row) num.append(contig.seq.reverse_complement()[i:i+20]) print("length%s" % len(num))
row has flags that dictate whether I pull the reverse complement. 0 and 16 mean exact match while the other two flags mean non exact match. I'm only appending to a list right now to make sure that I successfully get each sequence. It will write to a file when it works.
This code works exactly as intended, but it's SLOW. I suspect the problem is re-opening the draft genome sequence file for every row, but I can't open it outside of the for loop.
Any suggestions? (Keeping in mind that I am a wet-lab scientist that knows a little python)