Extract specific data from different tables (based on header) and merge it
0
0
Entering edit mode
5 days ago

It's a long post, but I wanted to be as complete as possible.

I would like to extract specific rows from specific columns from different tables, based on the column header. However, my knowledge seems to be too limited to solve it myself (both in command-line and python).

I have a directory with a growing number of .tsv files (OTU tables). These files all have the same layout:

• Rows 2 - x contain data. The number of columns can differ between files.
• Column 1 always contains the OTU ids.
• The last column always contains the Taxonomy.

text-formatted files can be found at the bottom

For example, file 1 looks like this:

OTU_id 20210801_sampleA 20210801_sampleB 20210801_sampleC Taxonomy
otuA 5 1 0 Hans is cool
otuB 2 0 0 Patricia has blue eyes
otuF 0 6 2 Rutger is actually Bart
otuZ 1 3 0 Felix was lost

File 2 looks like this:

OTU_id 20211111_sampleT 20211111_sampleG 20211111_sampleA Taxonomy
otuA 0 3 3 Hans is cool
otuB 2 1 0 Patricia has blue eyes
otuP 1 6 0 Thomas is great
otuT 2 22 56 Anna and Elsa
otuZ 1 8 4 Felix was lost

Of course, the number of samples and otus are different per file. As you can also see, the headers of the samples are always preceded by a date. In a certain file, this is always the same date.

What I want to achieve now is the following:

If I have a certain sample name, for example sampleA, I want to extract the results for this sample from all tables (.tsv files) in the directory. Of course sampleA should only be extracted from the files in which it is present.

In the first table it will find sampleA in the second column and it should take over the results for which the value in the below rows is larger than 0. In the second table, it will find sampleA in the 4th column.

The final table should look like this:

OTU_id 20210801_sampleA 20211111_sampleA Taxonomy
otuA 5 3 Hans is cool
otuB 2 0 Patricia has blue eyes
otuZ 1 4 Felix was lost
otuT 0 56 Anna and Elsa

SampleA has data for otuA,B,Z in the first file. These results are taken over (so the OTU_id, the values for header 20210801_sampleA and the Taxonomy).

As you will see, otuF was not extracted from the first table, because sampleA had a value 0 there. otuF also isn't present in table 2, so hence the total absence in the output file.

otuA and otuZ also had values > 0 in table 2, so these values were taken over.

otuB had a value of 0 for sampleA in table 2. Normally it would not extract it, but since it had been extracted from table 1, a 0 is added in that position.

otuT is only present in table 2 and thus was extracted there (OTU_id, value for 20211111_sampleA and Taxonomy). Because it was not found in table 1, a 0 is added in that position.

In the results file, the first column should thus also be the OTU_id. columns 2 to x should contain the data for sampleA, extracted from the files in the directory. The last column should contain the Taxonomy.

I hope you manage to understand everything. It has been bothering me for some time and lack the skills to figure it out. Please let me know if something isn't clear! How it's approached doesn't matter. As long as the final table contains the data from a specific sample across time.

Thanks a lot for your help and sorry to bother you with this!

Cheers Guillaume

Text-formatted table 1.tsv:

OTU_id  20210801_sampleA    20210801_sampleB    20210801_sampleC    Taxonomy
otuA    5   1   0   Hans is cool
otuB    2   0   0   Patricia has blue eyes
otuF    0   6   2   Rutger is actually Bart
otuZ    1   3   0   Felix was lost


text-formatted table2.tsv:

OTU_id  20211111_sampleT    20211111_sampleG    20211111_sampleA    Taxonomy
otuA    0   3   3   Hans is cool
otuB    2   1   0   Patricia has blue eyes
otuP    1   6   0   Thomas is great
otuT    2   22  56  Anna and Elsa
otuZ    1   8   4   Felix was lost


text-formatted output.tsv:

OTU_id  20210801_sampleA    20211111_sampleA    Taxonomy
otuA    5   3   Hans is cool
otuB    2   0   Patricia has blue eyes
otuZ    1   4   Felix was lost
otuT    0   56  Anna and Elsa

text-processing python command-line • 343 views
0
Entering edit mode

which program outputs the data this way? is this a custom script output?

0
Entering edit mode

The tsv tables are generated that way using QIIME2. This is an analysis pipeline (which we run in Linux) for analyzing DNA sequence data. Basically each table tells you how many times a certain DNA sequence (called an OTU) if found in a certain sample and the name of the microbe it potentially originates from (Taxonomy)

I just used some made up names for Taxonomy in the example. In reality this will, for example, be Escherichia Coli.

I didn't manage to nicely align the fields, but they are tab separated in all the input files, and it would be nice if they were the same in the output, because the Taxonomy field can have spaces, semi colons and commas in it.

0
Entering edit mode

Qiime2 has a merge OTU table function. I hope you can try it.

0
Entering edit mode

Indeed, but this takes the entire OTU tables and merges them, while we want to extract specific samples for different tables and only merge those.

Currently we create an OTU table with ~100 columns once a week. If a certain sample is analyzed 10 weeks in a row, first merging it would create a table with 1000 columns, after which we still need to do processing to extract only the 10 columns of that specific sample. Therefore at the moment we basically re-analyze just the raw data of a specific sample from different runs to create an OTU table from just that sample over time. But that's not handy.

1
Entering edit mode

Check if this is what is expected. I named the two files as 'a.txt' and 'b.txt' for the code. In R:

> library(readr)
> library(purrr)
> library(dplyr)
> filelist=list.files(pattern = "a|b") ## change the pattern as per your requirements
> df=lapply(filelist, function(x) read_tsv(x,  col_select = c(1,last_col(),ends_with("SampleA")), show_col_types = F))
> df %>%
+     reduce(full_join, by = c("OTU_id","Taxonomy")) %>%
+     filter(rowSums(.[,-c(1,2)], na.rm = T)!=0)
# A tibble: 4 × 4
OTU_id Taxonomy               20210801_sampleA 20211111_sampleA
<chr>  <chr>                               <dbl>              <dbl>
1 otuA   Hans_is_cool                            5                  3
2 otuB   Patricia_has_blue_eyes                  2                  0
3 otuZ   Felix_was_lost                          1                  4
4 otuT   Anna_and_Elsa                          NA                 56


For you change the filelist object to filelist=list.files(pattern = ".tsv"). This would read all the files as there is no way to know before hand that which file doesn't have sampleA column(s). However, this would not affect outcome i.e reading files without column A.

If you have many files, you can try doing parallelization in R esp while reading the files (load parallel library in addition to above libraries):

library(parallel)
numCores <- detectCores()-2
mclapply(filelist, function(x) read_tsv(x,  col_select = c(1,last_col()), show_col_types = F), mc.cores = numCores)

0
Entering edit mode

Tested and approved. It works like a charm! I also tested it on the real data and it works perfect! Will test it on a big dataset tomorrow, which will allow to test the parallel too. But I don't expect issues.

Figured out to export it:

> df %>%
+     reduce(full_join, by = c("OTU_id","Taxonomy")) %>%
+     filter(rowSums(.[,-c(1,2)], na.rm = T)!=0) -> output
> write_tsv(output, "test.tsv")


Many thanks again, this will save me a ton of time!