Perl script on rna G quadruplex
0
0
Entering edit mode
5 weeks ago
isha.lily20 ▴ 10

Hello researchers ,

i am stucked in perl script on rna g quadruplex to find/count the total number of specific unique sequences in which G should be 3 runs and loops should be 7 only , i used grep function with regular expression and array in which i gave input as a fasta file and in the last i counted regular exression in which the code is running means the perl script, but answers are not correct means answer is coming same for all different types of regular expression, can any one please help me out with the same ?

can any one share perl script for unique / specific sequences total counts of G 3 and L7 ONLY ..

i used the most common regular expression : ([gG]{3,}\w{1,7}){3,}[gG]{3,}

i tried simple code syntax=grep_function(regular_expression,@array)

full script i used :

#!/usr/bin/perl-w

#To count total transcripts containing G-Quadruplexes
#Input filename

print "Please enter file name: ";

$name =<>; chomp$name;

open OUT ,">.$name.OUTPUT"; open(FASTA,$name) or die;

@data =<FASTA>;
$data = join('',@data); #Convert to string @data2 = split('\n',$data); #Explode on newline into array elements
@unique = grep(!$seen{$_}++,@data2); #Extract unique elements from @data2

$unique = join('',@unique); #Convert to string @uniqueid = split('',$unique); #Explode string back into individual array elements.

#Intialize count

$countid=0; foreach$id(@uniqueid){

if($id eq "N"){ ++$countid;
}
}

#Print

print "\n\nNumber of transcripts is : $countid"; print OUT "Number of unique transcripts is :$countid";

#Exit
exit;

THANK U

perl G-quadruplex rna • 623 views
1
Entering edit mode

Hi Isha,

could you please revise the question to include more information about the background of the problem. It might for example not be clear to all readers what a G4 (a non-canonical nucleotide secondary structure) is, why and where (RNA, DNA) you are looking for them. Also it is unclear which approach you have tried and specifically which regular expression was used. You have to post the code you used for us to be able to spot any errors.

Also, see here: Quadruplex sequence batch prediction

0
Entering edit mode

Hello Michael Dondrup, i constructed transcriptome sequences through reference based RNA-seq data , G4 as u said was correct its a non canonical nucleotide secondary structure,

syntax code = grep_function(regular_expression,@array)

I have uploaded my full script , can u plz help me out ?

1
Entering edit mode

Hi, it is still not clear. Your script doesn't contain a pattern search, it is not even doing anything sensible now, except trying to make lines unique which is likely not what you want and does not work in the way you expect either. First, make your script into a proper strict syntax perl script. A proper perl script should start like this (works anywhere, except maybe for old SunOS/Solaris versions that have /bin/env instead of /usr/bin/env):

#!/usr/bin/env perl
use strict;
use warnings;

__END__

__END__ is optional, but you don't need to call exit at the end of the script.

Then properly define all package variables using my and go from there. I also recommend to use the BioPerl Fasta parser in your code, because your code as is doesn't parse the format properly.

0
Entering edit mode

Hi Michael , i have written half only , will execute other half part by one or 2 day: r u saying like this :

#!/usr/bin/perl -w

use strict;
use Bio::SeqIO;

## #Intialize the variables

my $gpat = '[G]{3}'; my$npat = '[A-Z]{1,7}';

my $pattern =$gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;
my $regex = qr/$pattern/i;

sed -i '1s/^/>/' genome.fa;
my $seqin = Bio::SeqIO->new(-file => "genomeome.fa", -format => "fasta"); my$seqout = Bio::SeqIO->new(-file => ">genome_100.fa", -format => "fasta");
while(my $seq =$seqin->next_seq) {
if($seq->length <= 100) {$seqout->write_seq($seq); } } #Open output file and input file respectively open OUT, ">Quadwhatposn.refMrna.fa" or die; open my$fh, '<', 'transcriptome.fa' or die $!; ADD REPLY 1 Entering edit mode #!/usr/bin/env perl use strict; use warnings; use Bio::SeqIO; ## #Intialize the variables my$gpat = '[G]{3,}'; # I think more than 3 Gs could also yield a G4
my $npat = '[A-Z]{1,7}'; my$pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat; my$regex = qr/$pattern/i; # you let perl compile the pattern here, ok, this might have mild speed advantage # I don't understand why you would call sed for inplace editing here, don't conflate the tasks # if you run the script twice, the sed command will be called again and again # make a proper input file in the first go my$seqin  = Bio::SeqIO->new(-file => $ARGV[0], -format => "fasta"); my$seqout = Bio::SeqIO->new(-file => '>'.$ARGV[0].'_G4.fa', -format => "fasta"); while(my$seq = $seqin->next_seq) { if($seq->seq =~ /$regex/) { # write the sequences that match the G4 pattern only to the output file # add the length filter too, if you need to$seqout->write_seq($seq); } } __END__ This should be pretty straight forward to filter sequences which contain the G4 motif. (I didn't understand why you opened more files at the end.) Call the script with the input file as parameter and see if that works. ADD REPLY 0 Entering edit mode Hi, Michael .. i think script can be made directly through perl also without using bioperl package , what u say, actually because i am getting confuse.. other half part will take time , loops i ll use next in the script. #!/usr/bin/perl use strict; use warnings; #Intialize the variables my$gpat = '[G]{4}';
my $npat = '[A-Z]{1,7}'; my$pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat; my$regex = qr/$pattern/i; #Open output file and input file respectively open OUT, ">Quadwhatposn.refMrna.fa" or die; open my$fh, '<', 'transcriptome.fa' or die $!; my ($id, $seq); my @totals = (0); #intialize the @counts arrays ADD REPLY 2 Entering edit mode I am nor sure what you mean here due to language problems. Sure, you can parse FASTA files without BioPerl, but why if it works, and it works fine for me. The most difficult and lengthy part is to install BioPerl, and with conda that is not even a big thing any more. I never had problems with speed when processing a genome or transcriptome. You will find plenty of examples for parsing a FASTA file directly in perl though. ADD REPLY 0 Entering edit mode Hi Michael, i tried this script, still it needs improvement , what are your suggestions ? #!/usr/bin/perl use strict; use warnings; #Intialize the variables my$gpat = '[G]{4}';
my $npat = '[A-Z]{1,7}'; my$pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat; my$regex = qr/$pattern/i; #Open output file and input file respectively open OUT, ">Quadwhatposn.refMrna.fa" or die; open my$fh, '<', 'transcriptome.fa' or die $!; my ($id, $seq); my @totals = (0); #intialize the @counts arrays #Loop while (<$fh>) {
chomp;
if (/^>(\w+)/) {
process_seq($seq) if$id;

$id =$1;
$seq = ''; print "$id\n";
print OUT "$id\n"; } elsif ($id) {
$seq .=$_;
process_seq(\$seq) if eof;
}
}
#Print
print "@totals\n";
print OUT "Total: \n@totals\n";
1
Entering edit mode

I suggest you use BioPerl an the script I posted :)

0
Entering edit mode

Hello Michael, can you show me your complete / full script , so that i can co -relate with my script? or if you can't , than can you tell me what algorithm you are using for making scripts ?