Question: Perl Programming - How to join two columns of a text file, in which values of the first column should match in order with the values of the second column
0
gravatar for genomics_student
5 weeks ago by
genomics_student0 wrote:

I am a beginner with Perl programming. The problem I am working on right now is how to get the gene length from a text file. Text file contains the gene name (column 10), start site (column 6), end site (column 7). The length can be derived from the difference of column 6 and 7. But my problem is how to match the gene name (from column 10) with the corresponding difference derived from the difference of column 6 and column 7. Thank you very much!

open (IN, "Alu.txt");
open (OUT, ">Alu_subfamlength3.csv");

while ($a = <IN>){
        @data = split (/\t/, $a);
        $list {$data[10]}++;
        $genelength {$data[7] - $data[6]};
}
        foreach $sub (keys %list){
        $gene = join ($sub, $genelength);
        print "$gene\n";
}
close (IN);
close (OUT);
sequence gene genome • 188 views
ADD COMMENTlink modified 5 weeks ago by SMK1.8k • written 5 weeks ago by genomics_student0
2

I am a beginner with Perl programming.

If you are now starting out with programming and can choose which language to learn, Perl is not the best choice. It was amazing for bioinformatics 10 years ago, but nowadays a better choice would be Python or R.

ADD REPLYlink written 5 weeks ago by WouterDeCoster40k
1

Good advice from @WouterDeCoster.

Or at least plan to learn Perl and Python or R. I started off learning Perl (early 2000s), then Python and R. I would not recommend Perl today. All data analytics is focused on Python or R. If you want to learn programming (and possible machine learning) learn Python. If you want to learn about data analysis, statistics, and data visualization learn R. R is not a good way to learn programming as the language was not designed for learning how to program but how to analyze data. See: https://en.wikipedia.org/wiki/Python_(programming_language)#Features_and_philosophy.

ADD REPLYlink written 5 weeks ago by Vince130

If I understand your question right, why are you not processing the file line by line so you can keep track of gene name and associate it with the gene length for each line?

ADD REPLYlink written 5 weeks ago by genomax70k

The processing of the file line by line is the obtaining the difference of data 7- data 6. However, my problem is how to affix the corresponding gene name or gene id which is in column 10. Looping or joining the columns in order using Perl code is what I'm figuring out right now. Thank you!

ADD REPLYlink written 5 weeks ago by genomics_student0

Hello genomics_student!

It appears that your post has been cross-posted to another site: https://stackoverflow.com/questions/57076449

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 5 weeks ago by Pierre Lindenbaum122k
0
gravatar for SMK
5 weeks ago by
SMK1.8k
SMK1.8k wrote:

Hi genomics_student,

Some points:

  • Try always including use strict; and use warnings;
  • You have to add OUT after print to really write things to Alu_subfamlength3
  • If column 10 is the last column, you have to do chomp($a) before you split it
  • In Perl, the array index starts with 0, so the indexes of columns 6, 7, 10 will be 5, 6, 9

Let's say if your Alu.txt looks something like this:

$ cat Alu.txt
.   .   .   .   .   10  20  .   .   Gene1
.   .   .   .   .   50  90  .   .   Gene1
.   .   .   .   .   100 120 .   .   Gene2
.   .   .   .   .   150 180 .   .   Gene2

You can change your code to:

#!/usr/bin/perl
use strict;
use warnings;

open( IN,  "Alu.txt" );
open( OUT, ">Alu_subfamlength3.csv" );

my %genelength;

while ( my $a = <IN> ) {
    chomp($a);
    my @data = split( /\t/, $a );
    $genelength{ $data[9] } += $data[6] - $data[5];
}
foreach my $gene ( keys %genelength ) {
    print OUT $gene, "\t", $genelength{$gene}, "\n";
}

close(IN);
close(OUT);

Where the output will be:

$ cat Alu_subfamlength3.csv
Gene1   50
Gene2   50
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by SMK1.8k

Thank you very much! I see, '+-' would match data 9 (gene name) with its corresponding difference (of data 5 and data 6).

ADD REPLYlink written 5 weeks ago by genomics_student0

I've updated some points and suggestions for you. Have fun with Perl, but do take the suggestions from WouterDeCoster into consideration... ;-)

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by SMK1.8k
$genelength{ $data[9] } += $data[6] - $data[5];

Is creating a key --> value pair.

ADD REPLYlink written 5 weeks ago by genomax70k

Thank you very much again! :)

ADD REPLYlink written 5 weeks ago by genomics_student0

Thank you so much for all the tips! I am very beginner in Perl language. As a student, I am required to learn it. Reading the basic Perl manuals has been not easy for me to generate and manipulate codes.

ADD REPLYlink written 5 weeks ago by genomics_student0

As a student, I am required to learn it.

Sadly, I can relate to this. A lot of institutions are yet to catch up with where the bioinformatics world is, which is that Perl is more specialized than widespread these days. Python/R are the go-to for quick scripting. Perl is used when there is a reason to using it. If possible, provide feedback that your curriculum is using an outdated way of learning how to script and interact with biological data formats.

ADD REPLYlink written 5 weeks ago by RamRS23k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1647 users visited in the last hour