Question: Adapt perl script
0
gravatar for weissert
3.3 years ago by
weissert0
weissert0 wrote:

I am not a Perl coder and hope to find help in this forum. I aim to run the script.pl, shown below on the following input but do not get the desired output:

here is my desired output:

>Pseudo_TTA 33-35
AGATCTGTAAGCTCAAAATGGTAGAGCTCTCGTTATGAGAGTTTGGTGGTTCGAATCCACCCAGATCTGcca
>His_GTG 34-36
GTGGCTGTAGTTTAGTGGTGAGAATTCCACGTTGTGGCCGTGGAGACCTGGGCTCGAATCCCAGCAGCCACAcca
>Trp_CCA 33-35
GGATCCGTGGCGCAATGGTAGCGCGTCTGACTCCAGATCAGAAGGTTGCGTGTTCGATTGACGTCGGGTTCAcca
>Met_CAT 35-37
GGGGTGGTGGCGCAGTTGGCTAGCGCGTAGGTCTCATAATCCTGAGGTCGAGAGTTCGAGCCTCTCTCACCCCAcca

input file1: tRNAfasta.fa

>chrA01_622419_622487_+_Pseudo_TTA_28.44
AGATCTGTAAGCTCAAAATGGTAGAGCTCTCGTTATGAGAGTTTGGTGGTTCGAATCCACCCAGATCTG
>chrA01_2528533_2528604_+_His_GTG_63.70
GTGGCTGTAGTTTAGTGGTGAGAATTCCACGTTGTGGCCGTGGAGACCTGGGCTCGAATCCCAGCAGCCACA
>chrA01_2624977_2625048_+_Trp_CCA_69.73
GGATCCGTGGCGCAATGGTAGCGCGTCTGACTCCAGATCAGAAGGTTGCGTGTTCGATTGACGTCGGGTTCA
>chrA01_21878294_21878379_+_Met_CAT_61.56
GGGGTGGTGGCGCAGTTGGCTAGCGCGTAGGTCTCATAGCTATCTGAGTAATCCTGAGGTCGAGAGTTCGAGCCTCTCTCACCCCA

Input file 2: codonpos.txt

chrA01.trna1 (622419-622487)    Length: 69 bp
Type: Sup   Anticodon: TTA at 33-35 (622451-622453) Score: 28.44
Possible pseudogene:  HMM Sc=-4.46  Sec struct Sc=32.90
         *    |    *    |    *    |    *    |    *    |    *    |    *   
Seq: AGATCTGTAaGCTCAAaaTGGTAGAGCTCTCGTTATGAGAGTtTGGTGGTTCGAATCCACCCAGATCTG
Str: >>>>>>>...>>>>.........<<<<.>>>.....<<<.....>>>>>.......<<<<<<<<<<<<.

chrA01.trna2 (2528533-2528604)  Length: 72 bp
Type: His   Anticodon: GTG at 34-36 (2528566-2528568)   Score: 63.70
         *    |    *    |    *    |    *    |    *    |    *    |    *    | 
Seq: GTGGCTGTAGTTTAGTGGTgAGAATTCCACGTTGTGGCCGTGGAGACCTGGGCTCGAATCCCAGCAGCCACA
Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<....>>>>>.......<<<<<<<<<<<<.

chrA01.trna3 (2624977-2625048)  Length: 72 bp
Type: Trp   Anticodon: CCA at 33-35 (2625009-2625011)   Score: 69.73
         *    |    *    |    *    |    *    |    *    |    *    |    *    | 
Seq: GGATCCGTGGCGCAATGGTAGCGCGTCTGACTCCAGATCAGAAGGtTGCGTGTTCGATTGACGTCGGGTTCA
Str: >>>>>>>..>>>>.......<<<<.>>>>>.......<<<<<.....>>>>.........<<<<<<<<<<<.

chrA01.trna27 (21878294-21878379)   Length: 86 bp
Type: Met   Anticodon: CAT at 35-37 (21878328-21878330) Score: 61.56
Possible intron: 39-50 (21878332-21878343)
         *    |    *    |    *    |    *    |    *    |    *    |    *    |    *    |    *
Seq: GGGGTGGTGGCGCAGTTGGCtAGCGCGTAGGTCTCATAgctatctgagtaATCCTGAGGtCGAGAGTTCGAGCCTCTCTCACCCCA
Str: >>>>>>>..>>>>.........<<<<.>>>>.....................<<<<.....>>>>>.......<<<<<<<<<<<<.

here is the command:

perl script.pl  tRNAfasta.fa codonpos.txt > out.fa

script.pl:

$fa = shift;
open(FA,$fa);

while($line = <FA>){
   chomp $line; 
   if($line =~ />/){
      $header = $line;
   }
   my($tRNA,$chrID,$loc,$strand,$aa,$codon,$len,$pb,$sc,$score) = split(" ",$header);
   $tRNA =~  s/>Mus_musculus_tRNA-//g; 
      $chrID =~ s/[()]//g;
   my($id,$name) = split("-",$chrID);
   $tRNA_id_hash{$id} = $tRNA;
}

$sort = shift;
open(SORT,$sort);

while($name = <SORT>){
   chomp $name;
   $aminoAcidInfo = <SORT>;
   chomp $aminoAcidInfo;
   my($type,$aa,$anti,$anti_nt,$at,$anticodon_loc,$chr) = split(" ",$aminoAcidInfo);
   $HMM = <SORT>;
   if($HMM =~ /intron/){
      $intron = <SORT>;
   }
   $star = <SORT>;
   $seq = <SORT>;
   chomp $seq;
   $seq =~ s/Seq: //g;
   $seq = $seq."CCA";
   $seq = uc($seq);
   $structure = <SORT>;
   $blank = <SORT>;
   my($id,$len) = split(/\t/,$name);
   my($chrID,$loc) = split(" ",$id);
   print ">$tRNA_id_hash{$chrID}\t$anticodon_loc\n$seq\n";
}

Here is the current output which is not at all what I desired :-( I tried to make sense of it but do not get it.

>   33-35
AGATCTGTAAGCTCAAAATGGTAGAGCTCTCGTTATGAGAGTTTGGTGGTTCGAATCCACCCAGATCTG
CCA
>   34-36
STR: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<....>>>>>.......<<<<<<<<<<<<.
CCA
>   |

CCA
>   
CCA
sequence gene software error • 969 views
ADD COMMENTlink modified 3.3 years ago by lakhujanivijay4.4k • written 3.3 years ago by weissert0

What is the use of input file1 here? It seems we can generate desired output from input file 2, codonpos.txt.

ADD REPLYlink written 3.3 years ago by venu6.2k

true. But the script is the only thing on hand and it requires the input of both file 1 and file 2.

ADD REPLYlink written 3.3 years ago by weissert0

A general advice: add comments to your code.

ADD REPLYlink written 3.3 years ago by dschika290
1
gravatar for Prakki Rama
3.3 years ago by
Prakki Rama2.3k
Singapore
Prakki Rama2.3k wrote:

If perl program is not compulsory, you can get desired output by running a series of "SED" commands (i used 8 in this case :-). I used only your "codonpos.txt" file to generate the output.

   egrep "^Type:|^Seq:" codonpos.txt \|
   sed 's/Sup/Pseudo/g' \| 
   sed '/^Type:/s% (.*%%' \|
   sed 's/Type: />/' \| 
   sed 's/   Anticodon: /_/' \|
   sed 's/ at /_/' \|
   sed 's/Seq: //g' \| 
   sed -e '/^>/! s/\(.*\)/\U\1/' \| 
   sed '/^>/! s%$%cca%'

Updated (as per Giovanni M Dall'Olio suggestion)

egrep "^Type:|^Seq:" input.txt | sed -e 's/Sup/Pseudo/g' -e '/^Type:/s% (.*%%' -e 's/Type: />/' -e 's/   Anticodon: /_/' -e 's/ at /_/' -e 's/Seq: //g' -e '/^>/! s/\(.*\)/\U\1/' -e '/^>/! s%$%cca%'

Output

>Pseudo_TTA_33-35
AGATCTGTAAGCTCAAAATGGTAGAGCTCTCGTTATGAGAGTTTGGTGGTTCGAATCCACCCAGATCTGcca
>His_GTG_34-36
GTGGCTGTAGTTTAGTGGTGAGAATTCCACGTTGTGGCCGTGGAGACCTGGGCTCGAATCCCAGCAGCCACAcca
>Trp_CCA_33-35
GGATCCGTGGCGCAATGGTAGCGCGTCTGACTCCAGATCAGAAGGTTGCGTGTTCGATTGACGTCGGGTTCAcca
>Val_CAC_35-37
GTCTGGGTGGTGTAGTCGGTTATCACGCTAGTCTCACACACTAGAGGTCCCCGGTTCGAACCCGGGCTCAGACAcca
ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Prakki Rama2.3k
1

You could invoke sed only once, specifying each expression with -e. This should reduce IO usage.

ADD REPLYlink written 3.3 years ago by Giovanni M Dall'Olio26k

@Giovanni M Dall'Olio: Updated. Thank you!

ADD REPLYlink written 3.3 years ago by Prakki Rama2.3k

perl is not a must :-)

I copied your script and saved it in script.sed Running on Ubuntu, executed the following commmand:

sed script.sed codonpos.txt

This is what happend:

sed: -e expression #1, char 10: unterminated `s' command
ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by weissert0

If you copied and pasted Prakki Rama's code into a file named script.sed, you have to type:

bash script.sed > output.fa

But, you can use Prakki Rama's command directly in the terminal, i.e., cd into the folder with the codonpos.txt and paste the command to your terminal.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by dschika290
0
gravatar for lakhujanivijay
3.3 years ago by
lakhujanivijay4.4k
India
lakhujanivijay4.4k wrote:

if you can explain what the pseudo code/algorithm is or whatever your are trying to achieve in a little more detail, may be I can help you. Waiting for your response!

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by lakhujanivijay4.4k

a) Reformat the header of the fasta sequences to be: >tRNA_name predicted_anti-codon_positions b) Remove predicted introns c) add CCA to the sequence tail

does it help`?

ADD REPLYlink written 3.3 years ago by weissert0
0
gravatar for Anima Mundi
3.3 years ago by
Anima Mundi2.4k
Italy
Anima Mundi2.4k wrote:

An ugly (but working) solution in Python:

def define_field(string, field_number): 
    field = ''
    take = False 
    string = ':' + string 
    string = string.replace('\n', '') 

    for char in string:

        if char == ':':

            if field_number - 1 == 0: 
                take = True
                field_number -= 1
            elif field_number - 1 == -1: 
                break
            else:
                field_number -= 1 
        else:
            if take:
                field += char

    return field


type = ''
anti = ''
at = ''

header = ''
seq = ''
for line in open('codonpos.txt'):
    if 'Type' in line:
        type = define_field(line, 2).replace('Anticodon','').replace('_','')
        anti = define_field(line, 3)[0:4]
        at = define_field(line, 3)[7:14]

        header = '>' + type.replace(' ','') + '_' + anti.replace(' ','') + at
        print header
    elif 'Seq' in line:
        print line.replace('Seq: ','').replace('\n','').upper() + 'cca'
    else:
        pass
ADD COMMENTlink written 3.3 years ago by Anima Mundi2.4k
0
gravatar for lakhujanivijay
3.3 years ago by
lakhujanivijay4.4k
India
lakhujanivijay4.4k wrote:

Here is a perl script for your purpose:

run it like $ perl test.pl

output file called out.fa will be created in same directory where you save this script.

open($fh, "codonpos.txt") or die "Can't open!";
open($fg, ">out.fa") or die "Can't open!";

@file=<$fh>;

foreach $line(@file){

if($line=~/^Type: (\w+)\s+Anticodon:\s(\w+)\sat\s(\d+\-\d+)/){

    if($1 eq 'Sup'){
    $codon='Psuedo';
    }

    else{
    $codon=$1;
    }

print $fg '>'.$codon.'_'.$2.'_'.$3."\n";
}

if($line=~/Seq: (\w+)/){
print $fg $1.'cca'."\n";
}


}

close $fh;
close $fg;

Please let me know if it works for you. In case of any difficulty, please get back to me.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by lakhujanivijay4.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2704 users visited in the last hour