I have a data file A.tsv
(field separator = \t
) :
id clade mutation
243 40A titi,xixi,lolo
254 20B titi,toto,jiji,lala
261
267 20B lala,jiji,jojo
and a template file B.tsv
(field separator = \t
) :
40A titi,toto,lala
40F xaxa,jojo,huhu
40C sasa,sisi,lala
Based on their common column (clade), I want to compare the mutation of A.tsv from the template B.tsv. When the clade in A.tsv are 20B: - If the corresponding mutation in A.tsv have all the mutation of 40A in B.tsv, print in a new column (after the last of A.tsv) named Conclusion
the clade 40A. - It's not a problem if the line 20B in A.tsv contain other mutation then those from 40A in B.tsv. - If the line 20B in A.tsv doesn't contain all the mutation from 40A in B.tsv, don't print anything.
The result (store in a new file C.tsv) will look like this:
id clade mutation Conclusion
243 40A titi,xixi,lolo
254 20B titi,toto,jiji,lala 40A
261
267 20B lala,jiji,jojo
I start with that :
awk 'BEGIN{ OFS=FS="\t" }
NR==FNR{ clade[$1]=$2; next }
FNR==1{ print $0, "Conclusion"; next }
!($2 in clade){ print; next }
{
XXXXXXXXX
}
' B.tsv A.tsv > C.tsv
but I don't know how to do the rest (the XXXXXXXX part). Do you have an idea? Thanks