How can I construct a polyA (e.g AATAAA)signal coordinate file and view them in genome browser?
Thanks
How can I construct a polyA (e.g AATAAA)signal coordinate file and view them in genome browser?
Thanks
OK, this is not really a answer to your question as I will use a perfect match algorithm (the signal will be either full of 'A' OR 'T'), but I found it fun to solve this problem using GNU Flex, the "Fast Lexical Analyzer". This following simple lex file just define a regex of for a polyA/T signal of at least 20 bases.
%{
#include <ctype.h>
#include <stdio.h>
#include <string.h>
int offset=0;
char name[BUFSIZ];
%}
%option noyywrap
POLYA ([aA][aA\n ]{20,}|[tT][tT\n]{20,})
%%
>.*\n {
offset=0;
strncpy(name,&yytext[1],yyleng-2);
}
{POLYA} {
int i;
int j=0;
int len=0;
for(i=0;i< yyleng;++i)
{
if(!isspace(yytext[i])) ++len;
}
printf("%s\t%d\t%d\t%d\n",name,offset,offset+len,len);
offset+=len;
}
. {
offset++;
}
\n
%%
int main(int argc,char** argv)
{
name[0]=0;
yylex();
return 0;
}
Compilation:
flex polya.l
gcc -O3 -Wall lex.yy.c
Example:
curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr1.fa.gz" |\
gunzip -c | ./a.out | head -n 20
chr1 67037 67058 21
chr1 71996 72017 21
chr1 84687 84710 23
chr1 88955 88979 24
chr1 123928 123957 29
chr1 154536 154558 22
chr1 155150 155173 23
chr1 164765 164796 31
chr1 244723 244745 22
chr1 319122 319147 25
chr1 412423 412451 28
chr1 650294 650323 29
chr1 701083 701103 20
chr1 785421 785454 33
chr1 824267 824288 21
chr1 832781 832805 24
chr1 868769 868791 22
chr1 932989 933023 34
chr1 974010 974032 22
chr1 1027314 1027339 25
Here is another possible solution in python with the pyfasta module. It will find starting locations that initiate a sequence of As or Ts, but unlike Pierre's solution it will not print the maximal such length. A little more fiddling in the while loop would do that; for now let's keep it simple:
import re
from pyfasta import Fasta
fast = Fasta('chr1.fa')
chr1 = fast['chr1']
patt = re.compile('A{20,}' + '|' + 'T{20,}')
size = 20
stream = enumerate(chr1)
for index, base in stream:
sub = chr1[index:index+size]
if patt.match(sub.upper()):
print 'chr1\t%s\t%s' % (index, index+size)
while base.upper() in 'AT':
index, base = stream.next()
PS: you can also modify the regular expression to match more cases
ialbert@borg ~/temp
$ python polya.py
chr1 67037 67057
chr1 71996 72016
chr1 84687 84707
chr1 88955 88975
chr1 123928 123948
chr1 154536 154556
chr1 155150 155170
chr1 164765 164785
chr1 244723 244743
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Do you already have the coordinates for the polyAs? or are you asking for a way to calculate them?
No, I just want to create a PolyA track using those polyA signals in quick and dirty way.