Question: Resolving multiple entries to just first one in perl program.
0
gravatar for kriti.awasthi23
4 months ago by
kriti.awasthi2310 wrote:

Hi, I am trying to read a Gen bank file which I have successfully done. Now, I trying to fix this error in this program that where ever it is finding gene it is printing all the results. I just want gene to be printed once that's all. I tried looping or increasing the counter and then returning the value to 0 but at some place I am not able to implement the code properly. I am posting the code below. Thanks in advance ,

SAMPLE FILE

 LOCUS       NR_046018               1652 bp    RNA     linear   PRI 12-MAY-2017
DEFINITION  Homo sapiens DEAD/H-box helicase 11 like 1 (DDX11L1), non-coding RNA.
ACCESSION   NR_046018 XM_003403543
VERSION     NR_046018.2
KEYWORDS    RefSeq.
SOURCE      Homo sapiens (human)
ORGANISM    Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 1652)
AUTHORS     Costa V, Casamassimi A, Roberto R, Gianfrancesco F, Matarazzo MR,
            D'Urso M, D'Esposito M, Rocchi M and Ciccodicola A.
TITLE       DDX11L: a novel transcript family emerging from human subtelomeric regions
JOURNAL     BMC Genomics 10, 250 (2009)
PUBMED      19476624
REMARK      Publication Status: Online-Only
COMMENT     VALIDATED REFSEQ: This record has undergone validation or
            preliminary review. The reference sequence was derived from
            AM992871.1.
            On Jun 5, 2012 this sequence version replaced NR_046018.1.

            ##Evidence-Data-START##
           Transcript exon combination :: AM992871.1, BM920886.1 [ECO:0000332]
           RNAseq introns              :: single sample supports all introns
                                       SAMEA1968968, SAMEA2148874
                                       [ECO:0000348]
           ##Evidence-Data-END##
PRIMARY     REFSEQ_SPAN         PRIMARY_IDENTIFIER PRIMARY_SPAN        COMP
        1-1652              AM992871.1         1-1652
FEATURES             Location/Qualifiers
 source          1..1652
                 /organism="Homo sapiens"
                 /mol_type="transcribed RNA"
                 /db_xref="taxon:9606"
                 /chromosome="1"
                 /map="1p36.33"
 gene            1..1652
                 /gene="DDX11L1"
                 /note="DEAD/H-box helicase 11 like 1"
                 /pseudo
                 /db_xref="GeneID:100287102"
                 /db_xref="HGNC:HGNC:37102"
 misc_RNA        1..1652
                 /gene="DDX11L1"
                 /product="DEAD/H-box helicase 11 like 1"
                 /pseudo
                 /db_xref="GeneID:100287102"
                 /db_xref="HGNC:HGNC:37102"

CODE:

   open (INFILE,"rna.txt");
   while ($line= <INFILE>)
  {
     chomp($line);
     if ($line =~ /(LOCUS\s*)(\w*)(.*)/)
      {
          print  "\n";
      print "Locus: $2\t";
       }
        elsif($line =~ /^\s*\/gene\=\"(.+)\"/ )
    {
           print "Gene: $1\n";
     }
   }

After this script is run the output is -

   LOCUS: NR_046018       Gene: DDX11L1
   Gene: DDX11L1
programming database perl • 244 views
ADD COMMENTlink modified 4 months ago • written 4 months ago by kriti.awasthi2310

Hello,

if you only have one locus in the file, you can just leave the loop by using the last statement if you have found the gene line.

fin swimmer

ADD REPLYlink written 4 months ago by finswimmer10k

Hello, I have a long file, I just posted a short file here. I am trying last statement but not able to get the desired result. It would be great if you could explain with a small example.

ADD REPLYlink written 4 months ago by kriti.awasthi2310

I'm not familiar with perl. Try this:

open (INFILE,"rna.txt");

while ($line= <INFILE>)
{
    chomp($line);
    if ($line =~ /(LOCUS\s*)(\w*)(.*)/)
    {
        print  "\n";
        print "Locus: $2\t";
        $gene = 0;
    }
    elsif($line =~ /^\s*\/gene\=\"(.+)\"/ && !gene)
    {
        print "Gene: $1\n";
        $gene = 1;
    }
}

fin swimmer

ADD REPLYlink written 4 months ago by finswimmer10k

Looks good, just remember to declare and initialize variables so it works with strict:

my $gene = 1; # don't print anything before we have seen a LOCUS tag
open (INFILE,"rna.txt");
....
ADD REPLYlink written 4 months ago by Michael Dondrup45k

Hi Kriti,

This should work as mentioned by finswimmer

Notice the last function

open (INFILE,"GB.txt");
while ($line= <INFILE>)
{
chomp($line);
if ($line =~ /(LOCUS\s*)(\w*)(.*)/)
{
print  "\n";
print "Locus: $2\t";
}
elsif($line =~ /^\s*\/gene\=\"(.+)\"/ )
{
print "Gene: $1\n";
last;
}
}

Output

Locus: NR_046018    Gene: DDX11L1
ADD REPLYlink modified 4 months ago • written 4 months ago by Vijay Lakhujani3.6k

Hello sir, Thanks for your reply. I have done this, the last statement her is not useful because the Genbank file has other Locus and genes too. I hope I am able to explain properly. Suggest a method which helps in matching this line:

  gene            1..1652

and then matches this line:

  /gene="DDX11L1"
ADD REPLYlink modified 4 months ago by Vijay Lakhujani3.6k • written 4 months ago by kriti.awasthi2310

Got the point. Shall get back to this. Also, please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLYlink written 4 months ago by Vijay Lakhujani3.6k
0
gravatar for Michael Dondrup
4 months ago by
Bergen, Norway
Michael Dondrup45k wrote:

While this script might work in your special case, I highly recommend to use the BioPerl GenBank parser instead. There are possibly scenarios where the parsing approach could fail, e.g. "false positives" (where the /gene= string appears outside of the opening gene environment), or where there are multiple genes per locus, the order of tag per locus is different from what is expected, etc.. The documentation/tutorial at https://bioperl.org/howtos/Features_and_Annotations_HOWTO.html#item12

shows specifically how to extract the values of primary_tags, which "gene" is one of.

ADD COMMENTlink written 4 months ago by Michael Dondrup45k
0
gravatar for kriti.awasthi23
4 months ago by
kriti.awasthi2310 wrote:

Hi, This is the possible answer I could come up with.

 open (INFILE,"rna.txt");
  $gene=0;
 while ($line= <INFILE>)
{
   chomp($line);
   if ($line =~ /(LOCUS\s*)(\w*)(.*)/)
   {
      print  "\n";
     print "Locus: $2\t";
    }
      elsif ($line=~ /(\s*gene\s*)(\d*)(\.\.)(\d*)/)
    {
    $begin= $2;
    $end= $4;
       print  "Gene_length: $begin..$end\t";
            $gene = 1;
     }
    elsif($gene == 1 && $line=~m /\s+\/gene\=\"(.+)\"/)
   {
    print " Gene $1\t";
    $gene = 0;
   }
}
ADD COMMENTlink written 4 months ago by kriti.awasthi2310
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1387 users visited in the last hour