Remove duplicates from fasta keeping at least one sequence per group based on header
1
1
Entering edit mode
3.8 years ago
vixelaa ▴ 20

I have a multifasta file that looks like this:

( all sequences are >100bp, more than one line, and same lenght )

>Lineage-1-samplenameA
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage-2-samplenameB
AAATTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAG
>Lineage-3-samplenameC
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage-3-samplenameD
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA

I need to remove the duplicates BUT keep at least on sequence per lineage. So in this simple example (Notice samplenameA,C and D are identical) above I would want to remove only samplenameD or samplenameC but not both of them. In the end I want to get the same header information as in the original file.

Example output:

>Lineage-1-samplenameA
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage-2-samplenameB
AAATTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAG
>Lineage-3-samplenameC
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA

I found out a way that works to remove just the duplicates. Thanks to Pierre Lindenbaum.

sed -e '/^>/s/$/@/' -e 's/^>/#/'
file.fasta  |\
tr -d '\n' | tr "#" "\n" | tr "@"
"\t" |\
sort -u -t '  ' -f -k 2,2  |\
sed -e 's/^/>/' -e 's/\t/\n/'

Running this on my example above would result in:

>Lineage-1-samplenameA
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage-2-samplenameB
AAATTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAG

—> so losing the lineage 3 sequence

Now I’m just looking for a quick solution to remove duplicates but keep at least one sequence per lineage based on the fasta header.

I’m new to scripting... any ideas in bash/python/R are welcome.

Thanks!!!

scripting fasta • 1.9k views
ADD COMMENT
1
Entering edit mode

Try dedupe.sh or clumpify.sh from BBMap suite of programs.

ADD REPLY
0
Entering edit mode

In Python you need something like that

from Bio import SeqIO

id_list = []
with open("output.fasta", "w") as out:
    for record in SeqIO.parse(open("fasta.fasta"), "fasta"):
        if record.id.split("_")[0] in id_list:
            continue
        else:
            id_list.append(record.id.split("_")[0])
            SeqIO.write(record, out, "fasta")
ADD REPLY
0
Entering edit mode

Thanks. I tried this but the output was all the sequences including the duplicates.

ADD REPLY
0
Entering edit mode
$ seqkit rmdup --quiet  -s test.fa
>Lineage-1-samplenameA
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGAC
GAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage-2-samplenameB
AAATTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGAC
GAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAG

test.fa is OP fasta and download seqkit here

ADD REPLY
0
Entering edit mode

That works for removing the duplicates indeed. But it is not what I need, like you show it removes both Lineage-3- samples while I need to keep one of them. This is because I need to have information about every lineage. But I’m guessing it will be easier to split my file into X files each containing all sequences of one of the lineages and on those run the command you suggest or the code in my question which does the same thing.

ADD REPLY
2
Entering edit mode
3.7 years ago

with seqkit:

$ seqkit --quiet --id-regexp '(^\w+-[0-9])' rmdup test.fa

>Lineage-1-samplenameA
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGAC
GAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage-2-samplenameB
AAATTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGAC
GAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAG
>Lineage-3-samplenameC
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGAC
GAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA

Regex is as per OP headers. Previous solutions are incorrect.

ADD COMMENT
0
Entering edit mode

Thanks for your help. Really appreciate it. This works on this simple example but only outputs one sequence per lineage. Removing all the rest. So it’s also not the desired output as I lose sequences that are not duplicates within each lineage. Maybe I’m wrong! But I think to achieve what I need, I will need to split my file into one file per lineage and remove the dups in every of those files and then merge them back together.

ADD REPLY
0
Entering edit mode

Anyway what you suggest is actually a correct solution to keep only one sequence per lineage! But this is sadly removing too many sequences for my data.

ADD REPLY

Login before adding your answer.

Traffic: 1985 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6