Perl Script To Get Upstream And Downstream Regions Of Start Codons
4
0
Entering edit mode
10.1 years ago
eddie.im ▴ 140

Hi,

I'm trying to get upstream and downstream regions of start codons in a genome (20nt upstream and 4nr downstream) but my script gets only the first start codon even if I use the "g" modifier on the regex. How can I get it to read all start codons (ATG)?

use strict;
use warnings;
my @regions;
my $term="ATG";
my $seqsample="CCCCATAGAGATAGAGATAGAGAACCCCGCGCGCTCGCATGGGGATGCATGATTCGG";
while ( $seq =~ m/(\S{20})$term(\S{4})/g ) {
        my $xx = $1.$term.$2;
    push (@regions, $xx);
    }
print "@regions\n"

Here's the script that I wrote.

Thanks in advance.

perl genomics • 5.4k views
ADD COMMENT
0
Entering edit mode

Hello I am also trying this, but instead of ATG I need to extract region from sequence position suppose in long sequence of 5000 nucl I want to extract +/- region from position

1------------------------------------------2946_GENE_SEQUENCE_2995-------------------------------------5000
                                                  200 residues here                                                200 residues here
ADD REPLY
0
Entering edit mode
>gi|546867469
TAGCTACATCAGTTGAAGATGGGGTTGCTGGGTAGCTGTCAGGCATAGGAGTCCCATCTTTCTCGGAAAG
AATGACAATAAGGTTGTTAAAGGGCAATCAGTCAAGGACAGGGGGAGCCACTCAAATCAGCACATGTGGC
ATGACAGTGATGCAGCATTCCCCTCAACCACCAATTGCTCCCAGGAATAGCACCTGGCTGGTCTGGCATC
TCGGGATGAGGGCTAAGGCACAGTGCCATTAGTGCCAAGAACCAAAGATCTGAGTCTTTGCAGACCCTAA
AAATTAAATCCTTTCTGTGCCCCGGGTGGAGGGGGAGAGTTCAAGTCACTTTATAATATTTGCCTCATAT
ACATATCACCTGTCCCCTTCCACCTCCTTTTTCTTGCAGTCCTAATTATCCAAGTAGCATCTACAACTCT

region to extract from

150 to 160 20-20 up/down stream

Kindly help me to extract this region with perl script or awk command.

Thank you

ADD REPLY
1
Entering edit mode
10.1 years ago

First

while ( $seqsample =~ m/(\S{20})$term(\S{4})/g ) {

Second, this construction of the variable flanked by other stuff just isn't working. I'd use pos to get the positions of ATG in the string, then use substr to pull out the flanking sequence.

ADD COMMENT
1
Entering edit mode

The OP only needs to use zero-width positive lookbehind and positive lookahead assertions, and capture their contents:

while ( $seqsample =~ /(?<=(\S{20}))$term(?=(\S{4}))/g ) { ...
ADD REPLY
1
Entering edit mode
10.1 years ago
JC 13k

The errror is because you are extending the ATG 20 bases before and 4 after, the regex is matching non-overlapping hits. Its better to find the ATG positions and get the string with substr as swbarnes2 suggested:

use strict;
use warnings;
my @regions;
my $term = "ATG";
my $len  = length $term;
my $seq  = "CCCCATAGAGATAGAGATAGAGAACCCCGCGCGCTCGCATGGGGATGCATGATTCGG";
while ( $seq =~ m/$term/g ) {
    my $pos = $+[0];
    my $xx = substr ($seq, $pos - 20 - $len, 20 + $len + 4);
    push (@regions, $xx);
}
print join "\n", @regions, "";

Output:

$ perl atg.pl 
AGAGAACCCCGCGCGCTCGCATGGGGA
CCCCGCGCGCTCGCATGGGGATGCATG
GCGCGCTCGCATGGGGATGCATGATTC
ADD COMMENT
0
Entering edit mode

The OP could just capture the contents of term-flanking zero-width assertions.

ADD REPLY
1
Entering edit mode
10.1 years ago
Kenosis ★ 1.3k

You're so very close! Just capture the contents of positive lookbehind and positive lookahead, zero-width assertions, so the search position isn't impacted by them:

use strict;
use warnings;

my ( $term, @regions ) = "ATG";
my $seq = "CCCCATAGAGATAGAGATAGAGAACCCCGCGCGCTCGCATGGGGATGCATGATTCGG";

while ( $seq =~ /(?<=(.{20}))$term(?=(.{4}))/g ) {
    push @regions, $1 . $term . $2;
}

print "$_\n" for @regions;

Output:

AGAGAACCCCGCGCGCTCGCATGGGGA
CCCCGCGCGCTCGCATGGGGATGCATG
GCGCGCTCGCATGGGGATGCATGATTC

Hope this helps!

ADD COMMENT
0
Entering edit mode

Thanks for your help! it worked very well!

ADD REPLY
0
Entering edit mode

You're most welcome!

ADD REPLY
0
Entering edit mode
8.7 years ago
cyril-cros ▴ 950

Another solution, without Perl:

grab the official annotation in gtf format, it should contains some lines with start_codon. Use grep to extract them, gtf2bed (from BEDOPS) and bedtools slop -l 20 -r 4 -s (from bedtools).

ADD COMMENT

Login before adding your answer.

Traffic: 2002 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6