Question

Perl Programming - How to join two columns of a text file, in which values of the first column should match in order with the values of the second column

0

Entering edit mode

6.0 years ago

genomics_student • 0

I am a beginner with Perl programming. The problem I am working on right now is how to get the gene length from a text file. Text file contains the gene name (column 10), start site (column 6), end site (column 7). The length can be derived from the difference of column 6 and 7. But my problem is how to match the gene name (from column 10) with the corresponding difference derived from the difference of column 6 and column 7. Thank you very much!

open (IN, "Alu.txt");
open (OUT, ">Alu_subfamlength3.csv");

while ($a = <IN>){
        @data = split (/\t/, $a);
        $list {$data[10]}++;
        $genelength {$data[7] - $data[6]};
}
        foreach $sub (keys %list){
        $gene = join ($sub, $genelength);
        print "$gene\n";
}
close (IN);
close (OUT);

sequence genome gene • 3.5k views

ADD COMMENT • link updated 6.0 years ago by AK ★ 2.2k • written 6.0 years ago by genomics_student • 0

2

Entering edit mode

I am a beginner with Perl programming.

If you are now starting out with programming and can choose which language to learn, Perl is not the best choice. It was amazing for bioinformatics 10 years ago, but nowadays a better choice would be Python or R.

ADD REPLY • link 6.0 years ago by WouterDeCoster 48k

1

Entering edit mode

Good advice from @WouterDeCoster.

Or at least plan to learn Perl and Python or R. I started off learning Perl (early 2000s), then Python and R. I would not recommend Perl today. All data analytics is focused on Python or R. If you want to learn programming (and possible machine learning) learn Python. If you want to learn about data analysis, statistics, and data visualization learn R. R is not a good way to learn programming as the language was not designed for learning how to program but how to analyze data. See: https://en.wikipedia.org/wiki/Python_(programming_language)#Features_and_philosophy.

ADD REPLY • link 6.0 years ago by Vince ▴ 150

0

Entering edit mode

If I understand your question right, why are you not processing the file line by line so you can keep track of gene name and associate it with the gene length for each line?

ADD REPLY • link 6.0 years ago by GenoMax 152k

0

Entering edit mode

The processing of the file line by line is the obtaining the difference of data 7- data 6. However, my problem is how to affix the corresponding gene name or gene id which is in column 10. Looping or joining the columns in order using Perl code is what I'm figuring out right now. Thank you!

ADD REPLY • link 6.0 years ago by genomics_student • 0

0

Entering edit mode

Hello genomics_student!

It appears that your post has been cross-posted to another site: https://stackoverflow.com/questions/57076449

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY • link 6.0 years ago by Pierre Lindenbaum 166k

score 0 · Answer 1 · 2019-07-17

0

Entering edit mode

6.0 years ago

AK ★ 2.2k

Hi genomics_student,

Some points:

Try always including use strict; and use warnings;
You have to add OUT after print to really write things to Alu_subfamlength3
If column 10 is the last column, you have to do chomp($a) before you split it
In Perl, the array index starts with 0, so the indexes of columns 6, 7, 10 will be 5, 6, 9

Let's say if your Alu.txt looks something like this:

$ cat Alu.txt
.   .   .   .   .   10  20  .   .   Gene1
.   .   .   .   .   50  90  .   .   Gene1
.   .   .   .   .   100 120 .   .   Gene2
.   .   .   .   .   150 180 .   .   Gene2

You can change your code to:

#!/usr/bin/perl
use strict;
use warnings;

open( IN,  "Alu.txt" );
open( OUT, ">Alu_subfamlength3.csv" );

my %genelength;

while ( my $a = <IN> ) {
    chomp($a);
    my @data = split( /\t/, $a );
    $genelength{ $data[9] } += $data[6] - $data[5];
}
foreach my $gene ( keys %genelength ) {
    print OUT $gene, "\t", $genelength{$gene}, "\n";
}

close(IN);
close(OUT);

Where the output will be:

$ cat Alu_subfamlength3.csv
Gene1   50
Gene2   50

ADD COMMENT • link 6.0 years ago by AK ★ 2.2k

0

Entering edit mode

Thank you very much! I see, '+-' would match data 9 (gene name) with its corresponding difference (of data 5 and data 6).

ADD REPLY • link 6.0 years ago by genomics_student • 0

0

Entering edit mode

I've updated some points and suggestions for you. Have fun with Perl, but do take the suggestions from WouterDeCoster into consideration... ;-)

ADD REPLY • link 6.0 years ago by AK ★ 2.2k

0

Entering edit mode

$genelength{ $data[9] } += $data[6] - $data[5];

Is creating a key --> value pair.

ADD REPLY • link 6.0 years ago by GenoMax 152k

0

Entering edit mode

Thank you very much again! :)

ADD REPLY • link 6.0 years ago by genomics_student • 0

0

Entering edit mode

Thank you so much for all the tips! I am very beginner in Perl language. As a student, I am required to learn it. Reading the basic Perl manuals has been not easy for me to generate and manipulate codes.

ADD REPLY • link 6.0 years ago by genomics_student • 0

0

Entering edit mode

As a student, I am required to learn it.

Sadly, I can relate to this. A lot of institutions are yet to catch up with where the bioinformatics world is, which is that Perl is more specialized than widespread these days. Python/R are the go-to for quick scripting. Perl is used when there is a reason to using it. If possible, provide feedback that your curriculum is using an outdated way of learning how to script and interact with biological data formats.

ADD REPLY • link 6.0 years ago by Ram 45k