extract data from txt files
1
0
Entering edit mode
8.6 years ago
fufuyou ▴ 110

I want to extract gene expression values from a txt file based on gene names.

For example:

A 1500
B 2500
C 3500
D 3400

I want to extract A and C gene expression values using gene name file extract it.

A 1500
C 3500

I have written some perl code, but it is not work well. Could you help me modify it or give me a new code?

Thanks,

#!/usr/bin/perl -w
use strict;
open(IN,"file1");
open(NAME,"file2");
open(OUT,">result");
while(<IN>){
  chomp;
  @data=split / /,$_;
  $hash{$data[0]}=$data[1];
}
while(<NAME>){
    chomp;
    print OUT "$_ $hash{$_}\n";
}
close IN;
close NAME;
close OUT;
RNA-Seq • 3.4k views
ADD COMMENT
4
Entering edit mode

you are reinventing the wheel: the solution is sorting both files and using linux join: http://linux.die.net/man/1/join

ADD REPLY
0
Entering edit mode

Thanks Pierre Lindenbaum and Zaag,

What is wrong of my perl code?

ADD REPLY
1
Entering edit mode

Please note that I vote for Pierre's answer! Nonetheless, I will try to shed some light on your programming experiments.

I am not sure, that

@data=split / /,$_;

works as you expect. I believe / / really tries to match $_ against the regular expression containing a single space. Using some parenthesis for your code, one gets:

@data=split( $_ =~ / /, $_ );

The reg-ex matching returns the result by means of 1 and '' (or something similar) which evaluate to boolean TRUE and FALSE in Perl. Consequently, Perl will evaluate the statement to:

@data=split('', $_);

or

@data=split('1', $_);

depending on the content of $_, i.e., does $_ contain a single space. Finally, the line will be splitted by '1' or '' which is not what you want, right?!

Tip 1:

Try to debug your script using Data::Dumper but be aware that you might get a long list of results:

#!/usr/bin/perl -w
use strict;
use Data::Dumper;

open(IN,"file1");
open(NAME,"file2");
open(OUT,">result");
while(<IN>){
  chomp;
  @data=split / /,$_;
  $hash{$data[0]}=$data[1];
}
print Dumper(\%hash);
while(<NAME>){
    chomp;
    print OUT "$_ $hash{$_}\n";
}
close IN;
close NAME;
close OUT;

Tip 2:

If you really want to do it by yourself, think about loading the file into memory which is smaller (I guess these are the names). Applying your script to very large data sets may eventually crash your computer.

Tip 3:

Really try to go for Pierre's solution ;-)

ADD REPLY
0
Entering edit mode

Hello fufuyou!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=62682

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
0
Entering edit mode
8.6 years ago
Zaag ▴ 860

In the shell:

for i in `cat file2` ;
do awk '$1eq"$i" {print } ' file1 >> result ;
done ;
ADD COMMENT
0
Entering edit mode

Using $i like this in awk looks like a bad idea. If both file1 and file2 have too many lines, your solution will take a long time.

ADD REPLY
0
Entering edit mode

True. Didn't know about linux's join command so thanks for that link.

ADD REPLY
0
Entering edit mode

I use this shell only get the name. I cannot get the gene expression values.

ADD REPLY

Login before adding your answer.

Traffic: 1882 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6