Imagine I have a basket containing fruits, and I want to create a data frame in
R with various informations on these fruits. I would make several rounds of analysis to collect information, for instance, that I have 2 apples, 1 banana and 3 oranges, and that apples are red, bananas are yellow and oranges, well.. orange. Each round of information would be run by a software that outputs its findings in a simple text format. For instance,
counts.txt would contain:
apple count 2 banana count 1 orange count 3
A second software would output
apple colour red banana colour yellow orange colour orange
I would then load the files in a
R data frame and reshape it to obtain:
name colour count apple red 2 banana yellow 1 orange orange 3
Of course, in the real life I am doing this with samples for which there are sequence reads in FASTQ files which have been processed and produced metadata such as mapping rate, proportion of reads in exons or introns, etc.
I wonder if there is a set of tools or a standard procedure somewhere that does the same but is less ad-hoc than my approach. For the data input and output, I have seen "triples" in serialisation formats like Turtle, RDF, etc., or even plain JSON, but they are much more complicated (that is, much harder to produce with the usual Unix command-line tools) than tab-separated triplets of subject, verb, object. For the loading into R, maybe it is trivial enough that it never seemed to deserve a package in CRAN.
Am I missing something ? How do you organise similar works ? One of my concerns, is that while this workflow is good enough for me to load data in
R, I would like an approach that is equally friendly for other people programming in other languages.
Edit: I would also just be happy with pointers to other works following the same approach.