Question

Removing Redundant Amino Acid Sequences From Fasta - *But Also Give The Groups Of Redundant Acc Numbers*

0

Entering edit mode

10.9 years ago

angel.roey • 0

Is there a way to remove redundant amino acid sequences from a fasta file but also output all the redundant accession numbers in groups, just like mothur's unique.seqs command (which unfortunately only works on nucleic acids data). the accession number output should look like this (or similar):

G9SS7BA01AM9A3    G9SS7BA01AM9A3,G9SS7BA01EMTMV,G9SS7BA01CYG40,G9SS7BA01AWI8Z,G9SS7BA01AFVJC,G9SS7BA01BCZCD,G9SS7BA01DBN7B,G9SS7BA01CZ7GO,G9SS7BA01C05FB
G9SS7BA01EAKDX    G9SS7BA01EAKDX,G9SS7BA01B1MNY
G9SS7BA01C2SRQ    G9SS7BA01C2SRQ,G9SS7BA01AK1UJ,G9SS7BA01BLVCZ,G9SS7BA01ARMFA
G9SS7BA01BQ5UG    G9SS7BA01BQ5UG,G9SS7BA01BZ9XF
G9SS7BA01BD4F9    G9SS7BA01BD4F9

Where each row is a group of identical seqs and the first column is the one kept in the 'uniques' file.

USEARCH only outputs a file with the unique seqs.

fasta amino-acids • 4.2k views

ADD COMMENT • link updated 7.2 years ago by Eslam Samir ▴ 110 • written 10.9 years ago by angel.roey • 0

score 3 · Answer 1 · 2013-06-06

3

Entering edit mode

10.9 years ago

Damian Kao 16k

You can use CD-HIT (http://weizhong-lab.ucsd.edu/cd-hit/) and parse the resulting cluster file into tab delimited format.

The only problem you might face with CD-HIT is that if your sequence IDs are really long, the cluster output file will shorten the name automatically. You might have rename your fasta files first to a shorter name and then remap the names back afterwards.

ADD COMMENT • link 10.9 years ago by Damian Kao 16k

2

Entering edit mode

You can change the maximum allowed length of the description in the cd-hit output file with the -d option

ADD REPLY • link 10.9 years ago by cts ★ 1.7k

0

Entering edit mode

Nice. I didn't know that. I guess I should really go through the options.

ADD REPLY • link 10.9 years ago by Damian Kao 16k

0

Entering edit mode

Thanks! Any possibility of influencing what is being written to the CLSTR file? e.g. only include seqs with redundancy, change format etc.

ADD REPLY • link 10.9 years ago by angel.roey • 0

0

Entering edit mode

Doesn't matter, USEARCH does all I need.

ADD REPLY • link 10.9 years ago by angel.roey • 0

score 2 · Answer 2 · 2013-06-06

2

Entering edit mode

10.9 years ago

cts ★ 1.7k

From memory, Usearch should give you the cluster file if you provide it the -uc <FILENAME> option

ADD COMMENT • link 10.9 years ago by cts ★ 1.7k

0

Entering edit mode

Thanks! It does. And in a nicer format than cd-hit

ADD REPLY • link 10.9 years ago by angel.roey • 0

score 0 · Answer 3 · 2017-02-20

Here is my free program on Github Sequence database curator (https://github.com/Eslam-Samir-Ragab/Sequence-database-curator)

It is a very fast program and it can deal with:

Nucleotide sequences
Protein sequences

It can work under Operating systems:

Windows
Mac
Linux

It also works for:

Fasta format
Fastq format

Best Regards