I have a multifasta file that looks like this:
( all sequences are >100bp, more than one line, and same lenght )
>Lineage-1-samplenameA
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage-2-samplenameB
AAATTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAG
>Lineage-3-samplenameC
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage-3-samplenameD
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
I need to remove the duplicates BUT keep at least on sequence per lineage. So in this simple example (Notice samplenameA,C and D are identical) above I would want to remove only samplenameD or samplenameC but not both of them. In the end I want to get the same header information as in the original file.
Example output:
>Lineage-1-samplenameA
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage-2-samplenameB
AAATTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAG
>Lineage-3-samplenameC
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
I found out a way that works to remove just the duplicates. Thanks to Pierre Lindenbaum.
sed -e '/^>/s/$/@/' -e 's/^>/#/'
file.fasta |\
tr -d '\n' | tr "#" "\n" | tr "@"
"\t" |\
sort -u -t ' ' -f -k 2,2 |\
sed -e 's/^/>/' -e 's/\t/\n/'
Running this on my example above would result in:
>Lineage-1-samplenameA
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage-2-samplenameB
AAATTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAG
—> so losing the lineage 3 sequence
Now I’m just looking for a quick solution to remove duplicates but keep at least one sequence per lineage based on the fasta header.
I’m new to scripting... any ideas in bash/python/R are welcome.
Thanks!!!
Try
dedupe.sh
orclumpify.sh
from BBMap suite of programs.In Python you need something like that
Thanks. I tried this but the output was all the sequences including the duplicates.
test.fa is OP fasta and download seqkit here
That works for removing the duplicates indeed. But it is not what I need, like you show it removes both Lineage-3- samples while I need to keep one of them. This is because I need to have information about every lineage. But I’m guessing it will be easier to split my file into X files each containing all sequences of one of the lineages and on those run the command you suggest or the code in my question which does the same thing.