Question

Perl Script To Get Upstream And Downstream Regions Of Start Codons

0

Entering edit mode

10.7 years ago

eddie.im ▴ 140

Hi,

I'm trying to get upstream and downstream regions of start codons in a genome (20nt upstream and 4nr downstream) but my script gets only the first start codon even if I use the "g" modifier on the regex. How can I get it to read all start codons (ATG)?

use strict;
use warnings;
my @regions;
my $term="ATG";
my $seqsample="CCCCATAGAGATAGAGATAGAGAACCCCGCGCGCTCGCATGGGGATGCATGATTCGG";
while ( $seq =~ m/(\S{20})$term(\S{4})/g ) {
        my $xx = $1.$term.$2;
    push (@regions, $xx);
    }
print "@regions\n"

Here's the script that I wrote.

Thanks in advance.

perl genomics • 5.7k views

ADD COMMENT • link updated 2.0 years ago by Ram 44k • written 10.7 years ago by eddie.im ▴ 140

0

Entering edit mode

Hello I am also trying this, but instead of ATG I need to extract region from sequence position suppose in long sequence of 5000 nucl I want to extract +/- region from position

1------------------------------------------2946_GENE_SEQUENCE_2995-------------------------------------5000
                                                  200 residues here                                                200 residues here

ADD REPLY • link updated 2.0 years ago by Ram 44k • written 9.3 years ago by harpreetmanku04 ▴ 10

0

Entering edit mode

>gi|546867469
TAGCTACATCAGTTGAAGATGGGGTTGCTGGGTAGCTGTCAGGCATAGGAGTCCCATCTTTCTCGGAAAG
AATGACAATAAGGTTGTTAAAGGGCAATCAGTCAAGGACAGGGGGAGCCACTCAAATCAGCACATGTGGC
ATGACAGTGATGCAGCATTCCCCTCAACCACCAATTGCTCCCAGGAATAGCACCTGGCTGGTCTGGCATC
TCGGGATGAGGGCTAAGGCACAGTGCCATTAGTGCCAAGAACCAAAGATCTGAGTCTTTGCAGACCCTAA
AAATTAAATCCTTTCTGTGCCCCGGGTGGAGGGGGAGAGTTCAAGTCACTTTATAATATTTGCCTCATAT
ACATATCACCTGTCCCCTTCCACCTCCTTTTTCTTGCAGTCCTAATTATCCAAGTAGCATCTACAACTCT

region to extract from

150 to 160 20-20 up/down stream

Kindly help me to extract this region with perl script or awk command.

Thank you

ADD REPLY • link updated 2.0 years ago by Ram 44k • written 9.3 years ago by harpreetmanku04 ▴ 10

score 1 · Answer 1 · 2014-03-18

1

Entering edit mode

10.7 years ago

swbarnes2 14k

First

while ( $seqsample =~ m/(\S{20})$term(\S{4})/g ) {

Second, this construction of the variable flanked by other stuff just isn't working. I'd use pos to get the positions of ATG in the string, then use substr to pull out the flanking sequence.

ADD COMMENT • link 10.7 years ago by swbarnes2 14k

1

Entering edit mode

The OP only needs to use zero-width positive lookbehind and positive lookahead assertions, and capture their contents:

while ( $seqsample =~ /(?<=(\S{20}))$term(?=(\S{4}))/g ) { ...

ADD REPLY • link 10.6 years ago by Kenosis ★ 1.3k

score 1 · Answer 2 · 2014-03-18

The errror is because you are extending the ATG 20 bases before and 4 after, the regex is matching non-overlapping hits. Its better to find the ATG positions and get the string with substr as swbarnes2 suggested:

use strict;
use warnings;
my @regions;
my $term = "ATG";
my $len  = length $term;
my $seq  = "CCCCATAGAGATAGAGATAGAGAACCCCGCGCGCTCGCATGGGGATGCATGATTCGG";
while ( $seq =~ m/$term/g ) {
    my $pos = $+[0];
    my $xx = substr ($seq, $pos - 20 - $len, 20 + $len + 4);
    push (@regions, $xx);
}
print join "\n", @regions, "";

Output:

$ perl atg.pl 
AGAGAACCCCGCGCGCTCGCATGGGGA
CCCCGCGCGCTCGCATGGGGATGCATG
GCGCGCTCGCATGGGGATGCATGATTC

score 1 · Answer 3 · 2014-03-20

1

Entering edit mode

10.6 years ago

Kenosis ★ 1.3k

You're so very close! Just capture the contents of positive lookbehind and positive lookahead, zero-width assertions, so the search position isn't impacted by them:

use strict;
use warnings;

my ( $term, @regions ) = "ATG";
my $seq = "CCCCATAGAGATAGAGATAGAGAACCCCGCGCGCTCGCATGGGGATGCATGATTCGG";

while ( $seq =~ /(?<=(.{20}))$term(?=(.{4}))/g ) {
    push @regions, $1 . $term . $2;
}

print "$_\n" for @regions;

Output:

AGAGAACCCCGCGCGCTCGCATGGGGA
CCCCGCGCGCTCGCATGGGGATGCATG
GCGCGCTCGCATGGGGATGCATGATTC

Hope this helps!

ADD COMMENT • link 10.6 years ago by Kenosis ★ 1.3k

0

Entering edit mode

Thanks for your help! it worked very well!

ADD REPLY • link 10.6 years ago by eddie.im ▴ 140

0

Entering edit mode

You're most welcome!

ADD REPLY • link 10.6 years ago by Kenosis ★ 1.3k

Ram · Answer 4 · 2015-07-27

0

Entering edit mode

9.3 years ago

cyril-cros ▴ 950

Another solution, without Perl:

grab the official annotation in gtf format, it should contains some lines with start_codon. Use grep to extract them, gtf2bed (from BEDOPS) and bedtools slop -l 20 -r 4 -s (from bedtools).

ADD COMMENT • link updated 2.0 years ago by Ram 44k • written 9.3 years ago by cyril-cros ▴ 950