Compare Two Fasta Files By Headers
7
8
Entering edit mode
12.8 years ago
Peixe ▴ 660

Hi everyone; this is my first question on the forum.

How can I compare if two fasta files contain the same sequence headers?

Does any BioPython module exist for doing this?

Thanks in advance, peixe

fasta comparison sequence biopython • 13k views
ADD COMMENT
1
Entering edit mode

So you should edit your question to read "contain the same sequence headers", rather than "the same sequences" which is a different check.

ADD REPLY
0
Entering edit mode

welcome ! Does each fasta file contain more than one sequence ?

ADD REPLY
0
Entering edit mode

Yes! They have exactly 8 sequences each, and I need to check if sequences names match between the two fasta files, and return the differences, if exist. Thank u, @pierre! ;)

ADD REPLY
0
Entering edit mode

if it's a comparison of sequence IDs, then it'd be as easy as 'grepping' all the IDs, and then comparing the two lists in Excel or with other online tools (e.g. http://jura.wi.mit.edu/bioc/tools/compare.php)

ADD REPLY
0
Entering edit mode

Ok, thank you all guys!

ADD REPLY
7
Entering edit mode
12.8 years ago

In BioPython you could do the following:

from Bio import SeqIO

def get_ids(fname):
    reader = SeqIO.parse(fname, 'fasta')
    ids = map(lambda x: x.id, reader)
    return set(ids)

s1 = get_ids('f1.fasta')
s2 = get_ids('f2.fasta')

print 'Common:', s1 & s2
print 'In f1 and not f2:', s1 - s2
print 'In f2 and not f1:', s2 - s1

prints for my testcase:

Common: set(['a', 'c', 'b'])
In f1 and not f2: set(['x'])
In f2 and not f1: set(['y'])
ADD COMMENT
0
Entering edit mode

You could shorten that with a generator expression,

s1 = setrec.id for rec in SeqIO.parse('f1.fasta'))
s2 = setrec.id for rec in SeqIO.parse('f2.fasta'))

I find this much simpler than introducing a function to do it.

ADD REPLY
0
Entering edit mode

This works if you add the 'fasta' format information to the Seq.IO.parse call:

s1 = setrec.id for rec in SeqIO.parse('f1.fasta', 'fasta'))
s2 = setrec.id for rec in SeqIO.parse('f2.fasta', 'fasta'))

However, an advantage of using the function that Istvan mentioned above is that you can create a function to extract part of the fast description (such as the accession number) with his method that you cannot with this shorter form. For example:

from Bio import SeqIO

def parse_accession(record_id):
    data = record_id.split('|')
    return data[3]

def get_ids(fname):
    reader = SeqIO.parse(fname, 'fasta')
    ids = map(lambda x: parse_accessionx.id), reader)
    return set(ids)

s1 = get_ids('f1.fasta')
s2 = get_ids('f2.fasta')

print 'Common:', s1 & s2
print 'In f1 and not f2:', s1 - s2
print 'In f2 and not f1:', s2 - s1
ADD REPLY
12
Entering edit mode
12.8 years ago
Pablo ★ 1.9k

Assuming you have two files (file1.fa and file2.fa) with multiple sequences, you can do something like this:

cat file1.fa | tr "[a-z]" "[A-Z]" | sed "s/^>.*/;/"  | tr -d "\n" | tr ";" "\n" | sort > file1.txt

cat file2.fa | tr "[a-z]" "[A-Z]" | sed "s/^>.*/;/"  | tr -d "\n" | tr ";" "\n" | sort > file2.txt

diff file1.txt file2.txt

A brief explanation:

  - tr "[a-z]" "[A-Z]" : Transform letters to upper case
  - sed "s/^>.*/;/" : Transform fasta annotations to a semicolon
  - tr -d "\n" : delete all newline characters
  - tr ";" "\n" : transform semicolons to newlines

At this point you get a file with one fasta sequence per line. Then just sort and compare.

ADD COMMENT
0
Entering edit mode

If you just want to compare the headers, then the commands are: cat file.fa | grep "^>" | sort > header.txt Do that for both files and then compare the TXT files using "diff"

ADD REPLY
0
Entering edit mode

Ok, nice answer..! It did the job.! Thank u @Pablo. Its always nice to see the explanation..

ADD REPLY
4
Entering edit mode
12.8 years ago

linearize the fasta as a tab delimited file, remove the empty (first) line, sort on the sequence column. Then join on the second column and cut to only keep the name having the same sequence.

cat file1.fasta  |\
awk '/^>/ {printf("\n%s\t",$0); next;} {printf("%s",toupper($0));} END{ printf("\n");}' |\
egrep -v '^$' |\
sort -t '   ' -k2,2 > file1.tsv

cat file2.fasta  |\
awk '/^>/ {printf("\n%s\t",$0); next;} {printf("%s",toupper($0));} END{ printf("\n");}' |\
egrep -v '^$' |\
sort -t '   ' -k2,2 > file2.tsv

join -t '  ' -1 2 -2 2 file1.tsv file2.tsv | cut -d '      ' -f 2,3
ADD COMMENT
2
Entering edit mode
12.8 years ago
Benm ▴ 710

It can be more shorter if the header is case sensitive:

awk '/^>/{if (a[$1]>=1){print $1}a[$1]++}' file1 file2
ADD COMMENT
1
Entering edit mode
12.6 years ago
Marvin ▴ 890

cmp <( grep '^>' file1 ) <( grep '^>' file2 )

This requires bash (not tcsh), and only tells you if the sequence headers are exactly the same in exactly the same order. Substitute "diff" for "cmp" to get a list of differences, but that will eat memory if there are lots of sequences (say, 100s of thousands). If the order shouldn't matter, sort the intermediate streams:

cmp <( grep '^>' file1 | sort ) <( grep '^>' file2 | sort )

(Also note that Fasta is a remarkably cumbersome format to use. You might be served better with BAM/SAM.)

ADD COMMENT
0
Entering edit mode
12.6 years ago
User 8087 ▴ 30

from Bio import SeqIO

def get_ids(fname):
    reader = SeqIO.parse(fname, 'fasta')
    ids = map(lambda x: x.id, reader)
    return set(ids)

s1 = get_ids('f1.fasta')
s2 = get_ids('f2.fasta')

print 'Common:', s1 & s2
print 'In f1 and not f2:', s1 - s2
print 'In f2 and not f1:', s2 - s1

Hi, there, regarding this answer, I am trying to do the same, except to output the results to different output files for each set of results, but I am new to Bioinformatics and I can't seem to get the output format right? Instead of Print I I try SeqIO.write("output_common.tab" 'w') and variants of this but can't seem to get it to output the information to a file instead of the terminal window? Sorry for such a newbie question but I can't seem to get it right.

ADD COMMENT
0
Entering edit mode

I think you must give your sequences a format... 'fasta'; or whatever. Remeber to close the filehandle after this function call.

This is from the manual of BioPython...

    SeqIO.write(sequences, handle, format)

where: sequences - A list (or iterator) of SeqRecord objects, or (if using Biopython 1.54 or later) a single SeqRecord.

handle - File handle object to write to, or filename as string

format - lower case string describing the file format to write.

Next time, ask this on the answer you are refering to. Hope this could help! (:

ADD REPLY
0
Entering edit mode
12.6 years ago
User 8087 ▴ 30

from Bio import SeqIO

def get_ids(fname):
    reader = SeqIO.parse(fname, 'fasta')
    ids = map(lambda x: x.id, reader)
    return set(ids)

s1 = get_ids('f1.fasta')
s2 = get_ids('f2.fasta')

print 'Common:', s1 & s2
print 'In f1 and not f2:', s1 - s2
print 'In f2 and not f1:', s2 - s1

Hi there, I can't seem to get this to work with trying to compare the following 2 FASTA files: f1.fasta f2.fasta Command$: python FASTA_diff_test.py f1.fasta f2.fasta

This is the error message I get back: Traceback (most recent call last): File "FASTA_diff_test.py", line 8, in [?] s1 = get_ids('f1.fasta') File "FASTA_diff_test.py", line 4, in get_ids reader = SeqIO.parse(fname, 'fasta') File "/usr/lib/pymodules/python2.6/Bio/SeqIO/init.py", line 424, in parse raise TypeError("Need a file handle, not a string (i.e. not a filename)") TypeError: Need a file handle, not a string (i.e. not a filename)

I can't figure out where the problem is, does anyone have any ideas?

I'm new to python and programming in general. I got the exact same thing (or at least I thought it was!) to work on Monday but for some reason it won't work now! Best Wishes,

Laura

ADD COMMENT

Login before adding your answer.

Traffic: 2618 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6