Converting column data to matrix
1
0
Entering edit mode
6.5 years ago
mittu1602 ▴ 200

I am trying to create a matrix of plant traits and plant species. There are 2,912,746 rows in the data and 3 columns. There are different numbers of traits for each species, and not every species has every trait. The data format is tab delimited.

Current format--

Species   Trait      Value
Species_1 SLA        4
Species_1 Photopath  C3
Species_1 Mycorrhiza AMF
Species_2 SLA        3 
Species_2 Growth     10

Desired format--

           SLA  Photopath         Mycorrhiza          Growth
Species_1   4   C3                  AMF
Species_2   3                        -                  10

I need a perl or python script. Thank you!

genome ngs • 1.4k views
ADD COMMENT
0
Entering edit mode

Python 2.7.12 using pandas:

>>> import pandas as nd
>>> df=nd.read_csv("test.txt", sep="\t")
>>> nd.crosstab(df.Species, df.Trait, df.Value, aggfunc=" ".join)

Trait     Growth Mycorrhiza Photopath SLA
Species                                  
Species_1    NaN        AMF        C3   4
Species_2     10        NaN       NaN   3

>>> df
     Species       Trait Value
0  Species_1         SLA     4
1  Species_1   Photopath    C3
2  Species_1  Mycorrhiza   AMF
3  Species_2         SLA     3
4  Species_2      Growth    10

In GNU-linux using datamash in bash shell :

$ datamash  -H  crosstab 1,2 unique 3  < test.txt  |  sed -e '1d'
        Growth  Mycorrhiza  Photopath   SLA
Species_1       AMF               C3    4
Species_2   10                          3

Insert - wherever data is not available

$  datamash  -H  crosstab 1,2 unique 3 --filler - < test.txt  |  sed -e '1d'
          Growth    Mycorrhiza  Photopath   SLA
Species_1   -        AMF              C3    4
Species_2   10        -                -    3

datamash is available in most of the linux repos and cross check the output.

ADD REPLY
0
Entering edit mode

Thank you I shall try it

ADD REPLY
0
Entering edit mode
6.5 years ago
JC 13k
#!/usr/bin/perl
use strict;
use warnings;
my %species =();
my %traits =();
my %data =();
while (<>) {
    chomp;
    next if (/Species\s+Trait/); # Skip first line
    my ($sp, $tr, $vl) = split (/\s+/, $_);
    $species{$sp}++;
    $traits{$tr}++;
    $data{$sp}{$tr} = $vl;
}
my @traits = sort keys %traits;
my @species = sort keys %species;
print join "\t", "Species", @traits; # header
print "\n";
foreach my $sp (@species) {
    print $sp;
    foreach my $tr (@traits) {
         my $val = "-"; # missing values
         $val = $data{$sp}{$tr} if (defined $data{$sp}{$tr});
         print "\t$val";
    }
    print "\n";
}

Use:

perl convert.pl < input > output

_Hashes are your friends_

ADD COMMENT
0
Entering edit mode

Hi thank you for your reply, the code exactly works the way I want it except it does not go to its respective column. it looks like:

                   SLA  Photopath         Mycorrhiza          Growth
Species_1   4   
                  C3                  
                  AMF

can u please help me fixing this.

ADD REPLY
0
Entering edit mode

Your text table is not using TABs, I changed the code to parse by spaces.

ASUNA:~$ cat in
Species   Trait      Value
Species_1 SLA        4
Species_1 Photopath  C3
Species_1 Mycorrhiza AMF
Species_2 SLA        3
Species_2 Growth     10
ASUNA:~$ perl convert.pl < in
Species Growth  Mycorrhiza      Photopath       SLA
Species_1       -       AMF     C3      4
Species_2       10      -       -       3
ADD REPLY
0
Entering edit mode

Thank you so much it worked perfectly.

ADD REPLY

Login before adding your answer.

Traffic: 2580 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6