Question

compare and trim fasta files by header ID

0

Entering edit mode

9.2 years ago

joie.cannon • 0

Hi there, first time poster here!

I'm trying to compare two series of multiple sequence alignments in fasta format by header ID. I've dug around the archives, and I haven't found anything quite like what I need to do. File1.fasta contains 65 sequences, and File2.fasta contains 188 sequences. I need to trim File2.fasta so that it only contains the 65 sequences that match the headers in File1.fasta. The headers will match, but the sequences themselves do not.

For example:

If File1.fasta looks like this:

>Species1|sequenceID1|1
GIYFEANGHGTILNETVGDAIADLLGT---WLNLYTDLPQRQLKVAVKDRTM---IQT-TDAERRC-
>Species2|m.5506|0
GVYAEANGHATFLNSKTGDAICDLFLI---WQATYTPLSYRQKKLTIPDRYA---VVT-TDADRRV-

And File2.fasta looks like this:

>Species1|sequenceID1|1
MSDIAKE-----KLA--ALMLEFPKPA-----------GVILGYGTAGFRSRAD----ILPWIMIRIGILASLRSKVKQ-A-----
>Species1|sequenceID2|0
MSDIAKE-----KLA--ALMLEFPKPA-----------GVILGYGTAGFRSRAD----ILPWIMIRIGILASLRSKVKQ-A-----
>Species1|sequenceID3|0
MTHIVKE-----KLM--TLLLEYPKPE-----------GIIMGYGTAGFRARAD----TLPWIMIRIGLLAALRSKVKR-A-----
>Species2|m.6218|1
--------------------------------------GVEFHYGTSGFRMHAD----RLDTVIFRMGLLSALRSQAIGGK-----
>Species2|m.5506|0
-----------------------------------------------------------------DIHFHYGTAGFRSPAK----

I want the output to be this:

>Species1|sequenceID1|1
MSDIAKE-----KLA--ALMLEFPKPA-----------GVILGYGTAGFRSRAD----ILPWIMIRIGILASLRSKVKQ-A-----
>Species2|m.5506|0
-----------------------------------------------------------------DIHFHYGTAGFRSPAK-----

It looks as though SeqIO in BioPerl or Biopython could do it tidily, but I'm not quite there with my programming skills.

Any help would be deeply appreciated!

fasta biopython bioperl • 2.3k views

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by joie.cannon • 0

1

Entering edit mode

Grep grep > all the header lines from file 1 into a new file.
Use grep -A1 -F -f. Google these parameters to know what they do and how they are used.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Using grep is a fine solution, but I want to point out that grep -A1 will only return the first line after the matched header line. I'm assuming the OP has some multiline sequences in her FASTA file.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Matt Shirley 10k

0

Entering edit mode

You are right. Mine won't work if it is a multiline fasta file.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Thanks! I was overcomplicating it and knew there should be a straightforward solution. I have already run the files through a program to remove newlines, so this should be ideal.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by joie.cannon • 0

0

Entering edit mode

What you mean by "remove newlines"? The fasta sequence should be contained within a single line for "-A 1" to work. That's it.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Ashutosh Pandey 12k

0

Entering edit mode

I think "remove newlines" was just in the context to the sequence portion of the file - to make multiline sequences into single lines.

ADD REPLY • link 9.2 years ago by Matt Shirley 10k

Ram · Answer 1 · 2015-03-06

$ pip install --user pyfaidx
$ faidx -i File1.fasta | cut -f1 | xargs faidx File2.fasta > File3.fasta

This command will grab a list of the sequence names in File1, one per line, then pass those back to faidx as regions to grab from File2, and send the results to File3.

pyfaidx also has an object-oriented interface you can use within a Python script for when you outgrow the "faidx" utility.

Ram · Answer 2 · 2015-03-06

Hello, go like this

grep '^>' file1.fasta > file1_headers.txt
Download 'faSomeRecords' from the link http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ and change its mode by chmod +x faSomeRecords
./faSomeRecords file2.fasta file1_headers.txt newFile2.fasta

Now your newFile2.fasta contains 65 sequences with the id from file1.fasta and sequence information of file2.fasta

Ram · Answer 3 · 2015-03-06

This only works IF 1) there are no linebreaks in the seqs and 2) there are no tabs in the headers. The tab in -t " " is a literal one (ctrl+v+tab).

join -1 1 -2 1 -t "       " -o 2.1,2.2 <(cat File1.fasta | tr "\n" "\t" | tr ">" "\n" | sort -k1,1) <(cat File2.fasta | tr "\n" "\t" | tr ">" "\n" | sort -k1,1) | awk 'BEGIN {FS="\t"} {print ">"$1"\n"$2}' | tail -n +3

>Species1|sequenceID1|1
MSDIAKE-----KLA--ALMLEFPKPA-----------GVILGYGTAGFRSRAD----ILPWIMIRIGILASLRSKVKQ-A-----
>Species2|m.5506|0
-----------------------------------------------------------------DIHFHYGTAGFRSPAK----