Question

Converting column data to matrix

0

Entering edit mode

6.5 years ago

mittu1602 ▴ 200

I am trying to create a matrix of plant traits and plant species. There are 2,912,746 rows in the data and 3 columns. There are different numbers of traits for each species, and not every species has every trait. The data format is tab delimited.

Current format--

Species   Trait      Value
Species_1 SLA        4
Species_1 Photopath  C3
Species_1 Mycorrhiza AMF
Species_2 SLA        3 
Species_2 Growth     10

Desired format--

           SLA  Photopath         Mycorrhiza          Growth
Species_1   4   C3                  AMF
Species_2   3                        -                  10

I need a perl or python script. Thank you!

genome ngs • 1.4k views

ADD COMMENT • link updated 6.5 years ago by JC 13k • written 6.5 years ago by mittu1602 ▴ 200

0

Entering edit mode

Python 2.7.12 using pandas:

>>> import pandas as nd
>>> df=nd.read_csv("test.txt", sep="\t")
>>> nd.crosstab(df.Species, df.Trait, df.Value, aggfunc=" ".join)

Trait     Growth Mycorrhiza Photopath SLA
Species                                  
Species_1    NaN        AMF        C3   4
Species_2     10        NaN       NaN   3

>>> df
     Species       Trait Value
0  Species_1         SLA     4
1  Species_1   Photopath    C3
2  Species_1  Mycorrhiza   AMF
3  Species_2         SLA     3
4  Species_2      Growth    10

In GNU-linux using datamash in bash shell :

$ datamash  -H  crosstab 1,2 unique 3  < test.txt  |  sed -e '1d'
        Growth  Mycorrhiza  Photopath   SLA
Species_1       AMF               C3    4
Species_2   10                          3

Insert - wherever data is not available

$  datamash  -H  crosstab 1,2 unique 3 --filler - < test.txt  |  sed -e '1d'
          Growth    Mycorrhiza  Photopath   SLA
Species_1   -        AMF              C3    4
Species_2   10        -                -    3

datamash is available in most of the linux repos and cross check the output.

ADD REPLY • link 6.5 years ago by cpad0112 21k

0

Entering edit mode

Thank you I shall try it

ADD REPLY • link 6.5 years ago by mittu1602 ▴ 200

score 0 · Answer 1 · 2017-10-31

0

Entering edit mode

6.5 years ago

JC 13k

#!/usr/bin/perl
use strict;
use warnings;
my %species =();
my %traits =();
my %data =();
while (<>) {
    chomp;
    next if (/Species\s+Trait/); # Skip first line
    my ($sp, $tr, $vl) = split (/\s+/, $_);
    $species{$sp}++;
    $traits{$tr}++;
    $data{$sp}{$tr} = $vl;
}
my @traits = sort keys %traits;
my @species = sort keys %species;
print join "\t", "Species", @traits; # header
print "\n";
foreach my $sp (@species) {
    print $sp;
    foreach my $tr (@traits) {
         my $val = "-"; # missing values
         $val = $data{$sp}{$tr} if (defined $data{$sp}{$tr});
         print "\t$val";
    }
    print "\n";
}

Use:

perl convert.pl < input > output

_Hashes are your friends_

ADD COMMENT • link 6.5 years ago by JC 13k

0

Entering edit mode

Hi thank you for your reply, the code exactly works the way I want it except it does not go to its respective column. it looks like:

                   SLA  Photopath         Mycorrhiza          Growth
Species_1   4   
                  C3                  
                  AMF

can u please help me fixing this.

ADD REPLY • link 6.5 years ago by mittu1602 ▴ 200

0

Entering edit mode

Your text table is not using TABs, I changed the code to parse by spaces.

ASUNA:~$ cat in
Species   Trait      Value
Species_1 SLA        4
Species_1 Photopath  C3
Species_1 Mycorrhiza AMF
Species_2 SLA        3
Species_2 Growth     10
ASUNA:~$ perl convert.pl < in
Species Growth  Mycorrhiza      Photopath       SLA
Species_1       -       AMF     C3      4
Species_2       10      -       -       3