Hi all,
I need to find, in a set of fasta sequences, sequence that has a region of 30 or more Gln and Asn residues per 80-residue window (Gln/Asn-Rich Tracts). Any programs I can use out there?
There is a mention about patmatch at this forum, can it be adapted to my purpose?
Just for completeness, a BioRuby solution to count Q + N in 80-residue moving window.
require 'rubygems'
require 'bio'
Bio::FlatFile.foreach("myseqs.fa") do |sequence|
sequence.seq.split("").each_cons(80) do |window|
puts window.count("Q") + window.count("N")
end
end
Here, each sequence is split into an array and the each_cons() array method provides the moving window - it returns nil if the sequence length is < 80.
Isn't that going to print the count for every possible 80 bp window even if it's zero?
ADD REPLY
• link
updated 6.1 years ago by
Ram
44k
•
written 14.4 years ago by
brentp
24k
0
Entering edit mode
Yes, this snippet would print the count for each window. You'd need to add a conditional (just another couple of lines), to print only if at least one of the window counts is >= 30.
If you use numpy, it's easy and fast to take an 80bp window using convolve
The script below assumes you're giving it a protein fasta and prints out sequences that are "enriched", but if you need the exact position where the enrichment occurs within each sequence, you'll need to add a couple lines (lookup numpy.where). runs in a few seconds for a 4.5MB file.
from pyfasta import Fasta
import numpy as np
fasta = 'orf_trans.fasta'
window = 80
cutoff = 30
f = Fasta(fasta)
kern = np.ones((window,))
enriched = 0
for k in f.keys():
seq = np.array(f[k], dtype='c')
# get a boolean array with True where AA is one of 'Q', 'N'
a = (seq == 'N') | (seq == 'Q')
# take an 80bp moving window.
mw = np.convolve(a, kern, mode='same')
# if it doesn't have any windows > 30, skip it.
if mw.max() < cutoff: continue
enriched += 1
# can also add something here to find the exact loc of the enrichement(s)
print k.split()[0]
print "\n%s out of %i sequences are enriched" % (enriched , len(f.keys()))
import re
seq = 'ACGGTCGIWYDEAJCFSHDKANNNNNQNQNQNQNQNDGASDGADJHDAGJDAQNNAQHKJHSHDDAAHHQNNQNQ'
score_by_position = {}
gln_or_asn_matcher = re.compile('[QN]')
step = 10 # replace it with 80 on the real seq
cutoff = 5 # replace it with 30
# note that this ignores all windows smaller than $step at the left and right of the sequence
for pos in range(0, len(seq)-step):
score = len(gln_or_asn_matcher.findall(seq[pos:pos+step]))
score_by_position[pos] = score
if score > cutoff:
print 'possible enriched region at position %s; %s' % (pos, seq[pos:pos+step])
# requires matplotlib
from matplotlib.pylab import bar, show
bar(score_by_position.keys(), score_by_position.values())
xticks(range(len(seq)), seq)
show()
of course, there it should be other tools that do this and in a more secure way.
you should test this script with other values before using it.
It is generally not good to import individual functions from a module.
You can do:
import matplotlib.pyplot as plt
plt.bar(score_by_position.keys(), score_by_position.values())
plt.xticks(range(len(seq)), seq)
plt.show()
ADD REPLY
• link
updated 5.1 years ago by
Ram
44k
•
written 14.4 years ago by
Zhaorong
★
1.4k
0
Entering edit mode
I wonder whether getting all the match positions into a list first (for the whole sequence), then looking at 80 bp windows across these indices would be faster on large scale (handily these are now in sorted order), that would avoid repeatedly invoking the findall. I am on the road so I can't try this out, will see if I get to it.
you are not looking for a motif, you are looking for enrichment in regions of a sequence.
@noyk Care to change the title according to @giovanni's suggestion? Great question :)
thanks.. will try out your suggestions
remember to select the answer that you want to accept, to thank its author.