Question: Enriched Regions Search Program
6
gravatar for Noyk
7.4 years ago by
Noyk70
Noyk70 wrote:

Hi all, I need to find, in a set of fasta sequences, sequence that has a region of 30 or more Gln and Asn residues per 80-residue window (Gln/Asn-Rich Tracts). Any programs I can use out there? There is a mention about patmatch at this forum, can it be adapted to my purpose?

ADD COMMENTlink modified 7.4 years ago by Neilfws47k • written 7.4 years ago by Noyk70
3

you are not looking for a motif, you are looking for enrichment in regions of a sequence.

ADD REPLYlink written 7.4 years ago by Giovanni M Dall'Olio25k

@noyk Care to change the title according to @giovanni's suggestion? Great question :)

ADD REPLYlink written 7.4 years ago by Eric Normandeau9.5k

thanks.. will try out your suggestions

ADD REPLYlink written 7.4 years ago by Noyk70

remember to select the answer that you want to accept, to thank its author.

ADD REPLYlink written 7.4 years ago by Giovanni M Dall'Olio25k
8
gravatar for Neilfws
7.4 years ago by
Neilfws47k
Sydney, Australia
Neilfws47k wrote:

Just for completeness, a BioRuby solution to count Q + N in 80-residue moving window.

require 'rubygems'
require 'bio'

Bio::FlatFile.foreach("myseqs.fa") do |sequence|
  sequence.seq.split("").each_cons(80) do |window|
    puts window.count("Q") + window.count("N")
  end
end

Here, each sequence is split into an array and the each_cons() array method provides the moving window - it returns nil if the sequence length is < 80.

ADD COMMENTlink written 7.4 years ago by Neilfws47k
1

neat! I love looking at small code snippets

ADD REPLYlink written 7.4 years ago by Istvan Albert ♦♦ 73k
1

cool. i dont use much ruby, and had never heard of each_cons. a decent summary: http://www.themomorohoax.com/2009/01/29/ruby-each-cons

isn't that going to print the count for every possible 80 bp window even if it's zero?

ADD REPLYlink written 7.4 years ago by brentp22k

Yes, this snippet would print the count for each window. You'd need to add a conditional (just another couple of lines), to print only if at least one of the window counts is >= 30.

ADD REPLYlink written 7.4 years ago by Neilfws47k
6
gravatar for brentp
7.4 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

if you use numpy, it's easy and fast to take an 80bp window using convolve

the script below assumes you're giving it a protein fasta and prints out sequences that are "enriched", but if you need the exact position where the enrichment occurs within each sequence, you'll need to add a couple lines (lookup numpy.where). runs in a few seconds for a 4.5MB file.

from pyfasta import Fasta
import numpy as np

fasta = 'orf_trans.fasta'
window = 80
cutoff = 30

f = Fasta(fasta)
kern = np.ones((window,))

enriched = 0
for k in f.keys():
    seq = np.array(f[k], dtype='c')
    # get a boolean array with True where AA is one of 'Q', 'N'
    a = (seq == 'N') | (seq == 'Q')

    # take an 80bp moving window.
    mw = np.convolve(a, kern, mode='same')
    # if it doesn't have any windows > 30, skip it.
    if mw.max() < cutoff: continue
    enriched += 1
    # can also add something here to find the exact loc of the enrichement(s)
    print k.split()[0]

print "\n%s out of %i sequences are enriched" % (enriched , len(f.keys()))
ADD COMMENTlink written 7.4 years ago by brentp22k

beautiful solution - numpy is amazing

ADD REPLYlink written 7.4 years ago by Istvan Albert ♦♦ 73k
4
gravatar for Giovanni M Dall'Olio
7.4 years ago by
London, UK
Giovanni M Dall'Olio25k wrote:

Well, a very didactic python solution would be:

import re
seq = 'ACGGTCGIWYDEAJCFSHDKANNNNNQNQNQNQNQNDGASDGADJHDAGJDAQNNAQHKJHSHDDAAHHQNNQNQ'
score_by_position = {}
gln_or_asn_matcher = re.compile('[QN]')
step = 10 # replace it with 80 on the real seq
cutoff = 5 # replace it with 30

# note that this ignores all windows smaller than $step at the left and right of the sequence    
for pos in range(0, len(seq)-step):
    score = len(gln_or_asn_matcher.findall(seq[pos:pos+step]))
    score_by_position[pos] = score
    if score > cutoff:
        print 'possible enriched region at position %s; %s' % (pos, seq[pos:pos+step])

# requires matplotlib
from matplotlib.pylab import bar, show
bar(score_by_position.keys(), score_by_position.values())
xticks(range(len(seq)), seq)
show()

of course, there it should be other tools that do this and in a more secure way. you should test this script with other values before using it.

ADD COMMENTlink written 7.4 years ago by Giovanni M Dall'Olio25k

Just a minor correction: from matplotlib.pylab import bar, show, xticks

It is generally not good to import individual functions from a module. You can do: import matplotlib.pyplot as plt plt.bar(score_by_position.keys(), score_by_position.values()) plt.xticks(range(len(seq)), seq) plt.show()

ADD REPLYlink written 7.4 years ago by Zhaorong1.1k

I wonder whether getting all the match positions into a list first (for the whole sequence), then looking at 80 bp windows across these indices would be faster on large scale (handily these are now in sorted order), that would avoid repeatedly invoking the findall. I am on the road so I can't try this out, will see if I get to it.

ADD REPLYlink written 7.4 years ago by Istvan Albert ♦♦ 73k
3
gravatar for Pierre Lindenbaum
7.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum98k wrote:

Here is my java solution:

File 'SearchGlnAsn.java':

import java.util.LinkedList;

public class SearchGlnAsn
    {
    public static void main(String[] args) throws Exception
        {
        int c;
        int pos=0;
        int count_N_Q=0;
        LinkedList<Boolean> window=new LinkedList<Boolean>();
        StringBuilder name=new StringBuilder();
        while((c=System.in.read())!=-1)
            {
            if(c=='>')
                {
                name.setLength(0);
                while((c=System.in.read())!='\n') name.append((char)c);
                pos=0;
                count_N_Q=0;
                window.clear();
                }
            else if(!Character.isWhitespace(c))
                {
                pos++;
                boolean isAA=(c=='N' || c=='Q' || c=='n' || c=='q');
                if(isAA) count_N_Q++;
                window.add(isAA);
                if(window.size()==80)
                    {
                    if(count_N_Q>30)
                        {
                        System.out.println(name+"\t"+count_N_Q+"\t"+(pos-80));
                        }
                    if(window.remove()) count_N_Q--;
                    }

                }
            }

        }

    }

Compilation:

javac SearchGlnAsn.java

Execution:

curl -s ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz |\
gunzip -c |\
java SearchGlnAsn |\
awk -F '    ' '{if($2>75) print $0; }'

sp|Q156A1|ATX8_HUMAN Ataxin-8 OS=Homo sapiens GN=ATXN8 PE=1 SV=1    79  0
sp|Q54UG3|Y1095_DICDI UPF0746 protein DDB_G0281095 OS=Dictyostelium discoideum GN=DDB_G0281095 PE=3 SV=1    76  415

 curl "http://www.uniprot.org/uniprot/Q156A1.fasta"
>sp|Q156A1|ATX8_HUMAN Ataxin-8 OS=Homo sapiens GN=ATXN8 PE=1 SV=1
MQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
QQQQQQQQQQQQQQQQQQQQ
ADD COMMENTlink modified 7.4 years ago • written 7.4 years ago by Pierre Lindenbaum98k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1259 users visited in the last hour