It's a long post, but I wanted to be as complete as possible.
I would like to extract specific rows from specific columns from different tables, based on the column header. However, my knowledge seems to be too limited to solve it myself (both in command-line and python).
I have a directory with a growing number of .tsv files (OTU tables). These files all have the same layout:
- Row 1 contains headers.
- Rows 2 - x contain data. The number of columns can differ between files.
- Column 1 always contains the OTU ids.
- The last column always contains the Taxonomy.
text-formatted files can be found at the bottom
For example, file 1 looks like this:
|otuA||5||1||0||Hans is cool|
|otuB||2||0||0||Patricia has blue eyes|
|otuF||0||6||2||Rutger is actually Bart|
|otuZ||1||3||0||Felix was lost|
File 2 looks like this:
|otuA||0||3||3||Hans is cool|
|otuB||2||1||0||Patricia has blue eyes|
|otuP||1||6||0||Thomas is great|
|otuT||2||22||56||Anna and Elsa|
|otuZ||1||8||4||Felix was lost|
Of course, the number of samples and otus are different per file. As you can also see, the headers of the samples are always preceded by a date. In a certain file, this is always the same date.
What I want to achieve now is the following:
If I have a certain sample name, for example sampleA, I want to extract the results for this sample from all tables (.tsv files) in the directory. Of course sampleA should only be extracted from the files in which it is present.
In the first table it will find sampleA in the second column and it should take over the results for which the value in the below rows is larger than 0. In the second table, it will find sampleA in the 4th column.
The final table should look like this:
|otuA||5||3||Hans is cool|
|otuB||2||0||Patricia has blue eyes|
|otuZ||1||4||Felix was lost|
|otuT||0||56||Anna and Elsa|
SampleA has data for otuA,B,Z in the first file. These results are taken over (so the OTU_id, the values for header 20210801_sampleA and the Taxonomy).
As you will see, otuF was not extracted from the first table, because sampleA had a value 0 there. otuF also isn't present in table 2, so hence the total absence in the output file.
otuA and otuZ also had values > 0 in table 2, so these values were taken over.
otuB had a value of 0 for sampleA in table 2. Normally it would not extract it, but since it had been extracted from table 1, a 0 is added in that position.
otuT is only present in table 2 and thus was extracted there (OTU_id, value for 20211111_sampleA and Taxonomy). Because it was not found in table 1, a 0 is added in that position.
In the results file, the first column should thus also be the OTU_id. columns 2 to x should contain the data for sampleA, extracted from the files in the directory. The last column should contain the Taxonomy.
I hope you manage to understand everything. It has been bothering me for some time and lack the skills to figure it out. Please let me know if something isn't clear! How it's approached doesn't matter. As long as the final table contains the data from a specific sample across time.
Thanks a lot for your help and sorry to bother you with this!
Text-formatted table 1.tsv:
OTU_id 20210801_sampleA 20210801_sampleB 20210801_sampleC Taxonomy otuA 5 1 0 Hans is cool otuB 2 0 0 Patricia has blue eyes otuF 0 6 2 Rutger is actually Bart otuZ 1 3 0 Felix was lost
OTU_id 20211111_sampleT 20211111_sampleG 20211111_sampleA Taxonomy otuA 0 3 3 Hans is cool otuB 2 1 0 Patricia has blue eyes otuP 1 6 0 Thomas is great otuT 2 22 56 Anna and Elsa otuZ 1 8 4 Felix was lost
OTU_id 20210801_sampleA 20211111_sampleA Taxonomy otuA 5 3 Hans is cool otuB 2 0 Patricia has blue eyes otuZ 1 4 Felix was lost otuT 0 56 Anna and Elsa