I have a large multifasta file (about 125,000 sequences) and a smaller multifasta file (about 100 sequences). All sequences in the smaller multifasta file are found in the larger file, but the headers are different. I have many (thousands) of such smaller multifasta files. How can I search the larger file for the sequences found in the smaller and then exchange the header? I would ideally be able to print out a smaller multifasta file that would be identical to the one I started with, just with the headers found in the larger file. All sequences in both files have been linearized- that is, they are a single line. Thanks!
The following might work for you.
grep -f other.fa reference.fa -B1 --no-group-separator
grep -f other.fa reference.fa -B1 | grep -v -- "^--$"
--no-group-separator is not available in your version of
Note that this will match substrings, which could be unwanted or undesired depending on your use case.
grep -B1 by @roblogan6 is very cool, however as he/she said, it matched substrings instead of whole sequences. Besides, sequences both in the big and small files must be in single-line format.
Here's is a robust preciser solution with SeqKit:
$ cat big.f >seq1 ACTACGACGTC TAGCGTA >seq2 CGACGATCTAC GTAGCTAGAT >seq3 ACGTCTGACGT >seq4 containing seq3 ACGTACGTCTG ACGTCC
$ cat small.fa >seq_abc ACTACGACGTC TAGCGTA >seq_123 ACGTCTGACGT
Precisely matching by sequences:
$ seqkit grep -s -i -f <(seqkit seq -s -w 0 small.fa) -w 70 big.fa >seq1 ACTACGACGTCTAGCGTA >seq3 ACGTCTGACGT
Here's the "long-option" version:
seqkit grep --by-seq --ignore-case --pattern-file <(seqkit seq --seq --line-width 0 small.fa) --line-width 70 big.fa