Hello, I have a task which seems it should be simple but I haven't found a solution yet. I have several thousand fasta files, each containing an alignment of 30 samples. The headers of each entry are the sample name, and every file contains the same 30 samples. I would like to concatenate the sequences of each fasta file such that I have one fasta file with the 30 samples. For example:
>Sample1 CCCCCCCCC >Sample2 AAAAAAAAA
>Sample1 TTTTTTTTTTTTTTT >Sample2 GGGGGGGGGGGGGGG
>Sample1 CCCCCCCCCTTTTTTTTTTTTTTT >Sample2 AAAAAAAAAGGGGGGGGGGGGGGG
So far the only solution I have come up with is this:
for sample in Sample1 Sample2 ; do echo ">$sample" > "$sample".temp.fasta ; for gene in Gene1 Gene2 ; do seqkit grep -p "$sample" "$gene".fasta | grep -v ">" >> "$sample".temp.fasta ; done ; done cat *.temp.fasta > AllGenes.fasta
but that seems terribly inefficient for thousands of genes, is there a better way?