Question: Remove extra sequences without unique names from fasta
0
gravatar for nmkn
4 weeks ago by
nmkn0
nmkn0 wrote:

I have hundreds of large fasta files (with almost hundred sequences in each). In some files, I have duplicate sequence NAMES but the sequence itself is NOT a duplicate. I have found other similar posts, but they want to remove duplicate sequences: How To Remove The Same Sequences In The Fasta Files?, Remove Duplicates In Fasta (Protein Seq.)

I simply want to keep one sequence (for which multiple have the same name) and remove the others that have the same name (sequence themselves aren't unique). I'd think this could be a simple bash command but can't find a solution. I thought first to try and count the number of non-unique sequence names, but even that didn't work:

grep ">" fasta.file | uniq -c

  1 >Sample1
  1 >Sample2
  1 >Sample3
  1 >Sample1

Any suggestions for a simple bash script or other? Here's my sample fasta:

>Sample1
tctctccttt
>Sample2
tctctcattt
>Sample3
tctctccttt
>Sample1
tctctccttg

So in this case, I want to keep the first Sample1 and remove the second. I have very limited scripting/bioinformatic experience so I greatly appreciate the help.

ADD COMMENTlink modified 4 weeks ago by Pierre Lindenbaum110k • written 4 weeks ago by nmkn0

If the sequences themselves aren't unique, how do you know which sequence you want to keep for those with duplicated names?

I could quickly throw together a python script that will do what you're asking, but maybe someone has a quicker solution.

ADD REPLYlink written 4 weeks ago by goodez240

The short answer is it doesn't matter which one I keep.

ADD REPLYlink written 4 weeks ago by nmkn0

By the way, the reason your grep | uniq -c did not work is because you need to sort before piping to uniq. So:

grep "^>" file.fa | sort | uniq -c
ADD REPLYlink written 4 weeks ago by goodez240
2
gravatar for Pierre Lindenbaum
4 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum110k wrote:

linearize + sort with unique flag + restore fasta:

 cat in.fasta in.fasta |\
 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' |\
 sort -t $'\t' -k1,1 -u |\
 tr "\t" "\n"
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Pierre Lindenbaum110k

Just to clarify the steps:

  1. (cat) I concatenate all hundreds of my fasta files into ONE fasta file?
  2. (awk) this is the linearize step

How are the next two steps working? I tried providing the input file but it errors:

sort -t $'\t' -k1,1 -u |\ < fasta.file_linearize.out
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by nmkn0
1

It looks like the awk step essentially converts from standard fasta format into a format of one entry per line. From there it's possible to use sort to only keep unique lines based on the fasta ID. The final step converts it back to normal fasta format (remove tabs, replace with new lines).

ADD REPLYlink written 4 weeks ago by goodez240

Thanks for the explanation. I can get the awk step to work using only one fasta file, but when I concat two files and use as input it doesn't work:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' |\ < sample_file > sample_linear.out

Also, is the above sort line correct to provide the input file?

ADD REPLYlink written 4 weeks ago by nmkn0
1

If you are executing the code on a single line as you wrote, then remove the \ characters because those are meant to ignore the new lines as Pierre wrote it. I recommend structuring it just as Pierre did in a text file.

Try nano rm_dup_fasta.sh, then copy Pierre's code into the file.

chmod 755 rm_dup_fasta.sh will make the file executable.

Then ./rm_dup_fasta.sh to run the bash script.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by goodez240

I concatenate all hundreds of my fasta files into ONE fasta file?

' just testing that running my pipeline twice on the same fasta file will output only one ID.

How are the next two steps working?

http://man7.org/linux/man-pages/man1/sort.1.html

I tried providing the input file but it errors:

https://stackoverflow.com/a/3871336/58082

sort -t $'\t' -k1,1 -u |\ < fasta.file_linearize.out

this is not a bash syntax.

ADD REPLYlink written 4 weeks ago by Pierre Lindenbaum110k

+1 for linking that very useful GitHub page. Never thought of linearizing a fasta before. Do you know if that awk line works for all fasta files, whether or not there are new lines within the sequences?

ADD REPLYlink written 4 weeks ago by goodez240
0
gravatar for cpad0112
4 weeks ago by
cpad01127.7k
India
cpad01127.7k wrote:

with seqkit: input:

 $ cat test.fa 
>Sample1
tctctccttt
>Sample2
tctctcattt
>Sample3
tctctccttt
>Sample1
tctctccttg

output:

$ seqkit rmdup test.fa --quiet
>Sample1
tctctccttt
>Sample2
tctctcattt
>Sample3
tctctccttt

Download seqkit from here

with case insensitive awk line, modified from one of the answers here. Upvote the OP):

$  awk '/^>/ {if (!a[tolower($0)]++) {print;getline;print}}' test.fa 
>Sample1
tctctccttt
>Sample2
tctctcattt
>Sample3
tctctccttt
>sample4
ctga

input fasta:

>Sample1
tctctccttt
>Sample1
atgc
>Sample2
tctctcattt
>Sample3
tctctccttt
>Sample1
tctctccttg
>sample1
ttgc
>sample2
atgc
>sample4
ctga
>Sample2
tct

if you want random sequence every time, from fasta with duplicated IDs (case insensitive):

$ seqkit fx2tab test.fa  | datamash  -sig 1 rand 2 | seqkit tab2fx
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by cpad01127.7k

This seqkit worked and was easy! If you add it as an answer I can marked as "accept"

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by nmkn0

By default seqkit dedups by name (-n) option. However this is case sensitive. Current deduping by name doesn't work if you want to dedup sequences by name, case insensitive way. However case insensitive sequence deduping works.

ADD REPLYlink written 4 weeks ago by cpad01127.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1405 users visited in the last hour