Question: Help: do I always have to re-open sequence file within a for loop in python?
1
gravatar for pawlowac
5.1 years ago by
pawlowac70
Boston, Harvard Medical School
pawlowac70 wrote:

I'm trying to extract nucleotide sequences from a draft genome based on alignment coordinates from a SAM file. Right now, I'm opening the SAM file as a csv, then matching sequence name and finally grabbing the sequence based on the coordinates.

import csv
import re
from Bio import SeqIO

bam_file=open('orfH9_offtarget.sam', 'rb')
bam_reader=csv.reader(bam_file,delimiter='\t')
offtarget_results = open('offtarget_orfH9.txt', 'w')

unique_seq_names=[]
parsed_data=[]
num=[]

for row in bam_reader:
    if row[2]:
        genome_seq=SeqIO.parse(open('genome.fasta', 'rU'), 'fasta')
        for contig in genome_seq:
            if row[2] in contig.id:
                    if int(row[1]) == 256:
                        i = int(row[3])
                        num.append(contig.seq[i:i+20])
                    elif int(row[1]) == 272:
                        i = int(row[3])
                        num.append(contig.seq.reverse_complement()[i:i+20])
                    elif int(row[1]) == 0:
                        i = int(row[3])
                        num.append(contig.seq[i:i+20])
                    elif int(row[1]) == 16:
                        i = int(row[3])
                        num.append(contig.seq.reverse_complement()[i:i+20])

     print("length%s" % len(num))    

row[1] has flags that dictate whether I pull the reverse complement. 0 and 16 mean exact match while the other two flags mean non exact match. I'm only appending to a list right now to make sure that I successfully get each sequence. It will write to a file when it works.

 

This code works exactly as intended, but it's SLOW. I suspect the problem is re-opening the draft genome sequence file for every row, but I can't open it outside of the for loop.

Any suggestions? (Keeping in mind that I am a wet-lab scientist that knows a little python)

 

biopython sam bowtie2 python • 1.9k views
ADD COMMENTlink modified 5.0 years ago by Peter5.9k • written 5.1 years ago by pawlowac70
2

ok, I know close to nothing of Python, but I understand you are opening the same genome over and over, right?

genome_seq=SeqIO.parse(open('genome.fasta', 'rU'), 'fasta')

If this is the case, of course you can move reading the genome to outside the loop. Then you will read it only once and keep it in memory.

ADD REPLYlink written 5.1 years ago by h.mon32k
1

I think reading the whole reference sequence and keeping it in memory is the way to go if you want to query many random positions. Even if fasta files are indexed, querying millions of positions will take much much longer then accessing the genome stored memory.

This simple function will read the reference fasta and make a dictionary of it (not as tested and robust as biopython but you drop a dependency):

def fastaToDict(faName):
    """Read fasta file faName and put it in dict with as {seqname:sequence}
    """
    fasta= open(faName)
    genome= {}
    for line in fasta:
        line= line.strip()
        if line.startswith('>'):
            chrom= line.lstrip('>').split()[0]
            genome[chrom]= []
            continue
        genome[chrom].append(line)
    fasta.close()
    for chrom in genome:
        genome[chrom]= ''.join(genome[chrom])
    return genome
ADD REPLYlink modified 13 months ago by Ram32k • written 5.1 years ago by dariober11k
2
gravatar for Matt Shirley
5.1 years ago by
Matt Shirley9.5k
Cambridge, MA
Matt Shirley9.5k wrote:

I can see that you're still developing this script, so I'll point out some code that might help you along the way. This snippet uses two Python modules I maintain:

Your current code is slow because each time you open the FASTA file you read through from top to bottom to find the sequence you want. This is unnecessary since FASTA files can be indexed and randomly accessed using pyfaidx. The "simplesam" module is an alternative to pysam and doesn't require any compiler - it's all Python - but you can go ahead and parse the SAM format if you are doing this as a learning exercise.

ADD COMMENTlink modified 13 months ago by Ram32k • written 5.1 years ago by Matt Shirley9.5k
1
gravatar for James Ashmore
5.1 years ago by
James Ashmore3.1k
UK/Edinburgh/MRC Centre for Regenerative Medicine
James Ashmore3.1k wrote:

A non-python solution would be to use bedtools:

# Create BED file of alignment coordinates, name field is sequence name, score field is flag number
bedtools bamtobed -tag -i orfH9_offtarget.sam > orfH9_offtarget.bed
# FASTA file of sequences corresponding to the alignment coordinates
bedtools getfasta -s -fi genome.fasta -bed orfH9_offtarget.bed > orfH9_offtarget.fasta
ADD COMMENTlink modified 13 months ago by Ram32k • written 5.1 years ago by James Ashmore3.1k

It's a good idea but as I said in my comment above if you have a bam file of millions of reads, bedtools getfasta will take ages.

ADD REPLYlink written 5.1 years ago by dariober11k
1
gravatar for Peter
5.0 years ago by
Peter5.9k
Scotland, UK
Peter5.9k wrote:

Rather than using SeqIO.parse to loop over the file every time, use SeqIO.index or SeqIO.index_db which can jump directly to the FASTA entry of interest and parse it on demand. There are example in the Biopython Tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html

ADD COMMENTlink written 5.0 years ago by Peter5.9k

I don't know about SeqIO.index, but if it uses the index to query the reference fasta in disk, it will take a long time to query millions of positions. Better to read the reference in memory, if possible.

(This is essentially the same comment I posted above.)

ADD REPLYlink written 5.0 years ago by dariober11k
0
gravatar for John
5.1 years ago by
John12k
Germany
John12k wrote:

Personally, I think it would be better to reduce the size/complexity of the input data first, by extracting from the SAM file all the data you need (and only the data you need) first, then once thats done using it to get DNA sequences:

import csv
import re
from Bio import SeqIO
import string
import collections

bam_file=open('orfH9_offtarget.sam', 'rb')
bam_reader=csv.reader(bam_file,delimiter='\t')
offtarget_results = open('offtarget_orfH9.txt', 'w')

contigs = set()
genome_seq = SeqIO.parse(open('genome.fasta', 'rU'), 'fasta')
for contig in genome_seq: contigs.add(contig)

forward = collections.defaultdict(list)
reverse = collections.defaultdict(list)

revcomp = string.maketrans('ACGT','TGCA')
def rc(DNA): return DNA.translate(revcomp)[::-1]

for row in bam_reader:
  if row[2] in contigs:
    if row[1] in ('256','0'):         # no need to int(row[1]) :)
      forward[row[2]].append(row[3])
    else:
      reverse[row[2]].append(row[3])

num = []
genome_seq = SeqIO.parse(open('genome.fasta', 'rU'), 'fasta')
for contig in genome_seq:
  seq = contig.seq
  for x in forward[contig]: num.append(seq[x:x+20])
  for x in reverse[contig]: num.append(rc(seq[x:x+20]))

print("length%s" % len(num))

Something like that. If i could have made some assumptions about your data (like, your SAM file is sorted by contig and pos) then I would have made it so we read X amount of data from the SAM file, until we come across a read that is in a different contig - then we do the DNA for the previous contig first, before moving on with new data. This would reduce the RAM used, but not by a whole lot.

Furthermore, if you have multiple reads starting at the same place, then I would have used a dict of {position:counter} rather than a list of positions (with repetition).

ADD COMMENTlink modified 5.0 years ago • written 5.1 years ago by John12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2230 users visited in the last hour
_