Question

compare two fasta files

0

Entering edit mode

6.5 years ago

jinutchirayath • 0

hello everyone, I want to compare 2 fasta files and discard matching sequences and print out the rest . for example file 1 :

>3-23985
UUCCACAGCUUUCUUGAACU
>9-10946
UGAAGCUGCCAGCAUGAUCU
>15-8830
UUAGAUUCACGCACAAACUCG
>17-8260
UUCCACAGCUUUCUUGAACUU
>35-5014
UUCCACAGCUUUCUUGAACC

file 2

>3-23985
UUCCACAGCUUUCUUGAACU
>15-8830
UUAGAUUCACGCACAAACUCG
>35-5014
UUCCACAGCUUUCUUGAACC

Desired output :

>9-10946
UGAAGCUGCCAGCAUGAUCU
>17-8260
UUCCACAGCUUUCUUGAACUU

Please do help me on this regard

Thank you

biopython awk bioawk fasta • 2.0k views

ADD COMMENT • link 6.5 years ago by jinutchirayath • 0

score 2 · Answer 1 · 2017-11-08

2

Entering edit mode

6.5 years ago

Pierre Lindenbaum 161k

Assuming there are only two lines per sequence and . (else use this).

cat   in1.fa  in2.fa | paste - - | sort | uniq -u | tr "\t" "\n"

ADD COMMENT • link 6.5 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

$ cat test1.fa test2.fa | paste --  | sort | uniq -u | tr "\t" "\n"
>17-8260
>9-10946
UGAAGCUGCCAGCAUGAUCU
UUCCACAGCUUUCUUGAACUU

.

$ cat test1.fa test2.fa | paste - - | sort | uniq -u | tr  '\t' '\n'
>17-8260
UUCCACAGCUUUCUUGAACUU
>9-10946
UGAAGCUGCCAGCAUGAUCU

.

 $ uname -a
Linux name 4.8.0-53-generic #56~16.04.1-Ubuntu SMP Tue May 16 01:18:56 UTC 2017 x86_64 GNU/Linux
 $ tr --version
tr (GNU coreutils) 8.28
Copyright (C) 2017 Free Software Foundation, Inc.

ADD REPLY • link 6.5 years ago by cpad0112 21k

0

Entering edit mode

this is not

paste --

but

paste - -

ADD REPLY • link 6.5 years ago by Pierre Lindenbaum 161k

score 1 · Answer 2 · 2017-11-08

From the example, I'm assuming, you have single line fasta (if not linearise) and you would like to extract unique fasta entries from file1 compared to file2

grep -F -x -v -f <(grep '^>' f2.txt) <(grep '^>' f1.txt) | while read -r ID; do grep "$ID" -A 1 f1.txt ; done

You can change f1.txt and f2.txt places if you want the results in otherway around.

score 1 · Answer 3 · 2017-11-08

$ grep ">" -h test2.fa test1.fa | sort | uniq -u | grep  --no-group-separator -A 1 -f - test1.fa

output:

>9-10946
UGAAGCUGCCAGCAUGAUCU
>17-8260
UUCCACAGCUUUCUUGAACUU

input:test1.fa and test2.fa

test1.fa

$ cat test1.fa 
>3-23985
UUCCACAGCUUUCUUGAACU
>9-10946
UGAAGCUGCCAGCAUGAUCU
>15-8830
UUAGAUUCACGCACAAACUCG
>17-8260
UUCCACAGCUUUCUUGAACUU
>35-5014
UUCCACAGCUUUCUUGAACC

test2.fa

$ cat test2.fa 
>3-23985
UUCCACAGCUUUCUUGAACU
>15-8830
UUAGAUUCACGCACAAACUCG
>35-5014
UUCCACAGCUUUCUUGAACC

score 1 · Answer 4 · 2017-11-08

Here's a Python-based way to do this, that does not use sorting:

#!/usr/bin/env python

import sys

if len(sys.argv) != 3:
    raise SystemError("Usage: ./filter_fasta.py target.fa query.fa")

target = sys.argv[1]
query = sys.argv[2]

m = {}
k = None
v = ''

with open(query, "r") as qfh:
    for line in qfh:
        line = line.strip()
        if line.startswith('>'):
            if k:
                m[k] = v
                v = ''
                k = line
            else:
                v += line
    m[k] = v

k = None
v = ''

with open(target, "r") as rfh:
    for line in rfh:
        line = line.strip()
        if line.startswith('>'):
            if k and (k not in m):
                sys.stdout.write("%s\n%s\n" % (k, v))
            k = line
            v = ''
        else:
            v += line
    if k and (k not in m):
        sys.stdout.write("%s\n%s\n" % (k, v))

Output:

$ ./filter_fasta.py f1.fa f2.fa 
>9-10946
UGAAGCUGCCAGCAUGAUCU
>17-8260
UUCCACAGCUUUCUUGAACUU

You could probably use the other approaches if your inputs are small or don't need much preprocessing.

Here are a couple advantages of my approach:

Other approaches require sorting. For large datasets, sorting can get expensive in time. My approach reads through each file once to make hash tables ("dictionaries" in Python-speak), and as O(n+m) < O(nlogn + mlogm) you'll end up spending a lot less time making hash tables than on sorting, if your inputs are very large.
You don't need to linearize the FASTA file inputs. This script takes in multiline FASTA.

score 0 · Answer 5 · 2017-11-09

0

Entering edit mode

6.5 years ago

jinutchirayath • 0

Dear all.. Thank you so much for your prompt reply ... It really helped me in my work...

ADD COMMENT • link 6.5 years ago by jinutchirayath • 0