Question: Two multifasta files with the same sequences, but different headers. How to change headers to match?
0
gravatar for roblogan6
2.3 years ago by
roblogan630
roblogan630 wrote:

I have a large multifasta file (about 125,000 sequences) and a smaller multifasta file (about 100 sequences). All sequences in the smaller multifasta file are found in the larger file, but the headers are different. I have many (thousands) of such smaller multifasta files. How can I search the larger file for the sequences found in the smaller and then exchange the header? I would ideally be able to print out a smaller multifasta file that would be identical to the one I started with, just with the headers found in the larger file. All sequences in both files have been linearized- that is, they are a single line. Thanks!

multifasta fasta perl • 1.1k views
ADD COMMENTlink modified 2.3 years ago by shenwei3564.5k • written 2.3 years ago by roblogan630
2
gravatar for Eric Lim
2.3 years ago by
Eric Lim1.2k
Boston
Eric Lim1.2k wrote:

The following might work for you.

grep -f other.fa reference.fa -B1 --no-group-separator

or

grep -f other.fa reference.fa -B1 | grep -v -- "^--$"

if --no-group-separator is not available in your version of grep.

Note that this will match substrings, which could be unwanted or undesired depending on your use case.

ADD COMMENTlink written 2.3 years ago by Eric Lim1.2k

Thanks! Is there any way to change the grep command to only print out the headers themselves, instead of also including the associated sequences? I am not very familiar with grep. Thanks again.

ADD REPLYlink written 2.3 years ago by roblogan630
1

You could use grep "^>" , which would get you just the headers.

Looks like the original grep solution worked? I am curious since you had large files.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax62k

I couldn't do it locally because grep runs out of memory, but I can run it successfully over the server that I have access to. I need to learn more about grep, awk, sed, etc. They seem quick, powerful, and really simple. Maybe I am deceiving myself about the simple part though! Do you have any resources to suggest for learning more about these type of commands? Thanks again.

ADD REPLYlink written 2.3 years ago by roblogan630
1
gravatar for shenwei356
2.3 years ago by
shenwei3564.5k
China
shenwei3564.5k wrote:

Solution of grep -B1 by @roblogan6 is very cool, however as he/she said, it matched substrings instead of whole sequences. Besides, sequences both in the big and small files must be in single-line format.

Here's is a robust preciser solution with SeqKit:

Big file:

$ cat big.f
>seq1
ACTACGACGTC
TAGCGTA
>seq2
CGACGATCTAC
GTAGCTAGAT
>seq3
ACGTCTGACGT
>seq4 containing seq3
ACGTACGTCTG
ACGTCC

Small file:

$ cat small.fa
>seq_abc
ACTACGACGTC
TAGCGTA
>seq_123
ACGTCTGACGT

Precisely matching by sequences:

$ seqkit grep -s -i -f <(seqkit seq -s -w 0 small.fa) -w 70 big.fa    
>seq1
ACTACGACGTCTAGCGTA
>seq3
ACGTCTGACGT

Here's the "long-option" version:

seqkit grep --by-seq --ignore-case --pattern-file <(seqkit seq --seq --line-width 0 small.fa) --line-width 70 big.fa
ADD COMMENTlink written 2.3 years ago by shenwei3564.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 651 users visited in the last hour