Hi there, first time poster here!
I'm trying to compare two series of multiple sequence alignments in fasta format by header ID. I've dug around the archives, and I haven't found anything quite like what I need to do. File1.fasta contains 65 sequences, and File2.fasta contains 188 sequences. I need to trim File2.fasta so that it only contains the 65 sequences that match the headers in File1.fasta. The headers will match, but the sequences themselves do not.
For example:
If File1.fasta looks like this:
>Species1|sequenceID1|1
GIYFEANGHGTILNETVGDAIADLLGT---WLNLYTDLPQRQLKVAVKDRTM---IQT-TDAERRC-
>Species2|m.5506|0
GVYAEANGHATFLNSKTGDAICDLFLI---WQATYTPLSYRQKKLTIPDRYA---VVT-TDADRRV-
And File2.fasta looks like this:
>Species1|sequenceID1|1
MSDIAKE-----KLA--ALMLEFPKPA-----------GVILGYGTAGFRSRAD----ILPWIMIRIGILASLRSKVKQ-A-----
>Species1|sequenceID2|0
MSDIAKE-----KLA--ALMLEFPKPA-----------GVILGYGTAGFRSRAD----ILPWIMIRIGILASLRSKVKQ-A-----
>Species1|sequenceID3|0
MTHIVKE-----KLM--TLLLEYPKPE-----------GIIMGYGTAGFRARAD----TLPWIMIRIGLLAALRSKVKR-A-----
>Species2|m.6218|1
--------------------------------------GVEFHYGTSGFRMHAD----RLDTVIFRMGLLSALRSQAIGGK-----
>Species2|m.5506|0
-----------------------------------------------------------------DIHFHYGTAGFRSPAK----
I want the output to be this:
>Species1|sequenceID1|1
MSDIAKE-----KLA--ALMLEFPKPA-----------GVILGYGTAGFRSRAD----ILPWIMIRIGILASLRSKVKQ-A-----
>Species2|m.5506|0
-----------------------------------------------------------------DIHFHYGTAGFRSPAK-----
It looks as though SeqIO in BioPerl or Biopython could do it tidily, but I'm not quite there with my programming skills.
Any help would be deeply appreciated!
grep >
all the header lines from file 1 into a new file.grep -A1 -F -f
. Google these parameters to know what they do and how they are used.Using grep is a fine solution, but I want to point out that
grep -A1
will only return the first line after the matched header line. I'm assuming the OP has some multiline sequences in her FASTA file.You are right. Mine won't work if it is a multiline fasta file.
Thanks! I was overcomplicating it and knew there should be a straightforward solution. I have already run the files through a program to remove newlines, so this should be ideal.
What you mean by "remove newlines"? The fasta sequence should be contained within a single line for "-A 1" to work. That's it.
I think "remove newlines" was just in the context to the sequence portion of the file - to make multiline sequences into single lines.