compare and trim fasta files by header ID
3
0
Entering edit mode
9.2 years ago

Hi there, first time poster here!

I'm trying to compare two series of multiple sequence alignments in fasta format by header ID. I've dug around the archives, and I haven't found anything quite like what I need to do. File1.fasta contains 65 sequences, and File2.fasta contains 188 sequences. I need to trim File2.fasta so that it only contains the 65 sequences that match the headers in File1.fasta. The headers will match, but the sequences themselves do not.

For example:

If File1.fasta looks like this:

>Species1|sequenceID1|1
GIYFEANGHGTILNETVGDAIADLLGT---WLNLYTDLPQRQLKVAVKDRTM---IQT-TDAERRC-
>Species2|m.5506|0
GVYAEANGHATFLNSKTGDAICDLFLI---WQATYTPLSYRQKKLTIPDRYA---VVT-TDADRRV-

And File2.fasta looks like this:

>Species1|sequenceID1|1
MSDIAKE-----KLA--ALMLEFPKPA-----------GVILGYGTAGFRSRAD----ILPWIMIRIGILASLRSKVKQ-A-----
>Species1|sequenceID2|0
MSDIAKE-----KLA--ALMLEFPKPA-----------GVILGYGTAGFRSRAD----ILPWIMIRIGILASLRSKVKQ-A-----
>Species1|sequenceID3|0
MTHIVKE-----KLM--TLLLEYPKPE-----------GIIMGYGTAGFRARAD----TLPWIMIRIGLLAALRSKVKR-A-----
>Species2|m.6218|1
--------------------------------------GVEFHYGTSGFRMHAD----RLDTVIFRMGLLSALRSQAIGGK-----
>Species2|m.5506|0
-----------------------------------------------------------------DIHFHYGTAGFRSPAK----

I want the output to be this:

>Species1|sequenceID1|1
MSDIAKE-----KLA--ALMLEFPKPA-----------GVILGYGTAGFRSRAD----ILPWIMIRIGILASLRSKVKQ-A-----
>Species2|m.5506|0
-----------------------------------------------------------------DIHFHYGTAGFRSPAK-----

It looks as though SeqIO in BioPerl or Biopython could do it tidily, but I'm not quite there with my programming skills.

Any help would be deeply appreciated!

fasta biopython bioperl • 2.3k views
ADD COMMENT
1
Entering edit mode
  1. Grep grep > all the header lines from file 1 into a new file.
  2. Use grep -A1 -F -f. Google these parameters to know what they do and how they are used.
ADD REPLY
0
Entering edit mode

Using grep is a fine solution, but I want to point out that grep -A1 will only return the first line after the matched header line. I'm assuming the OP has some multiline sequences in her FASTA file.

ADD REPLY
0
Entering edit mode

You are right. Mine won't work if it is a multiline fasta file.

ADD REPLY
0
Entering edit mode

Thanks! I was overcomplicating it and knew there should be a straightforward solution. I have already run the files through a program to remove newlines, so this should be ideal.

ADD REPLY
0
Entering edit mode

What you mean by "remove newlines"? The fasta sequence should be contained within a single line for "-A 1" to work. That's it.

ADD REPLY
0
Entering edit mode

I think "remove newlines" was just in the context to the sequence portion of the file - to make multiline sequences into single lines.

ADD REPLY
0
Entering edit mode
9.2 years ago
$ pip install --user pyfaidx
$ faidx -i File1.fasta | cut -f1 | xargs faidx File2.fasta > File3.fasta

This command will grab a list of the sequence names in File1, one per line, then pass those back to faidx as regions to grab from File2, and send the results to File3.

pyfaidx also has an object-oriented interface you can use within a Python script for when you outgrow the "faidx" utility.

ADD COMMENT
0
Entering edit mode
9.2 years ago
venu 7.1k

Hello, go like this

  1. grep '^>' file1.fasta > file1_headers.txt
  2. Download 'faSomeRecords' from the link http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ and change its mode by chmod +x faSomeRecords
  3. ./faSomeRecords file2.fasta file1_headers.txt newFile2.fasta

Now your newFile2.fasta contains 65 sequences with the id from file1.fasta and sequence information of file2.fasta

ADD COMMENT
0
Entering edit mode
9.2 years ago
5heikki 11k

This only works IF 1) there are no linebreaks in the seqs and 2) there are no tabs in the headers. The tab in -t " " is a literal one (ctrl+v+tab).

join -1 1 -2 1 -t "       " -o 2.1,2.2 <(cat File1.fasta | tr "\n" "\t" | tr ">" "\n" | sort -k1,1) <(cat File2.fasta | tr "\n" "\t" | tr ">" "\n" | sort -k1,1) | awk 'BEGIN {FS="\t"} {print ">"$1"\n"$2}' | tail -n +3

>Species1|sequenceID1|1
MSDIAKE-----KLA--ALMLEFPKPA-----------GVILGYGTAGFRSRAD----ILPWIMIRIGILASLRSKVKQ-A-----
>Species2|m.5506|0
-----------------------------------------------------------------DIHFHYGTAGFRSPAK----
ADD COMMENT

Login before adding your answer.

Traffic: 1329 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6