Question: Removing Redundant Amino Acid Sequences From Fasta - *But Also Give The Groups Of Redundant Acc Numbers*
0
gravatar for angel.roey
5.4 years ago by
angel.roey0
angel.roey0 wrote:

Is there a way to remove redundant amino acid sequences from a fasta file but also output all the redundant accession numbers in groups, just like mothur's unique.seqs command (which unfortunately only works on nucleic acids data). the accession number output should look like this (or similar):

G9SS7BA01AM9A3    G9SS7BA01AM9A3,G9SS7BA01EMTMV,G9SS7BA01CYG40,G9SS7BA01AWI8Z,G9SS7BA01AFVJC,G9SS7BA01BCZCD,G9SS7BA01DBN7B,G9SS7BA01CZ7GO,G9SS7BA01C05FB
G9SS7BA01EAKDX    G9SS7BA01EAKDX,G9SS7BA01B1MNY
G9SS7BA01C2SRQ    G9SS7BA01C2SRQ,G9SS7BA01AK1UJ,G9SS7BA01BLVCZ,G9SS7BA01ARMFA
G9SS7BA01BQ5UG    G9SS7BA01BQ5UG,G9SS7BA01BZ9XF
G9SS7BA01BD4F9    G9SS7BA01BD4F9

Where each row is a group of identical seqs and the first column is the one kept in the 'uniques' file.

USEARCH only outputs a file with the unique seqs.

fasta amino-acids • 2.5k views
ADD COMMENTlink modified 20 months ago by Eslam Samir100 • written 5.4 years ago by angel.roey0
3
gravatar for Damian Kao
5.4 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

You can use CD-HIT (http://weizhong-lab.ucsd.edu/cd-hit/) and parse the resulting cluster file into tab delimited format.

The only problem you might face with CD-HIT is that if your sequence IDs are really long, the cluster output file will shorten the name automatically. You might have rename your fasta files first to a shorter name and then remap the names back afterwards.

ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by Damian Kao15k
2

You can change the maximum allowed length of the description in the cd-hit output file with the -d option

ADD REPLYlink written 5.4 years ago by cts1.6k

Nice. I didn't know that. I guess I should really go through the options.

ADD REPLYlink written 5.4 years ago by Damian Kao15k

Thanks! Any possibility of influencing what is being written to the CLSTR file? e.g. only include seqs with redundancy, change format etc.

ADD REPLYlink written 5.4 years ago by angel.roey0

Doesn't matter, USEARCH does all I need.

ADD REPLYlink written 5.4 years ago by angel.roey0
2
gravatar for cts
5.4 years ago by
cts1.6k
Pasadena
cts1.6k wrote:

From memory, Usearch should give you the cluster file if you provide it the -uc <FILENAME> option

ADD COMMENTlink written 5.4 years ago by cts1.6k

Thanks! It does. And in a nicer format than cd-hit

ADD REPLYlink written 5.4 years ago by angel.roey0
0
gravatar for Eslam Samir
20 months ago by
Eslam Samir100
Egypt / Cairo / Microbiology & Immunology Department
Eslam Samir100 wrote:

Here is my free program on Github Sequence database curator (https://github.com/Eslam-Samir-Ragab/Sequence-database-curator)

It is a very fast program and it can deal with:

  1. Nucleotide sequences
  2. Protein sequences

It can work under Operating systems:

  1. Windows
  2. Mac
  3. Linux

It also works for:

  1. Fasta format
  2. Fastq format

Best Regards

ADD COMMENTlink written 20 months ago by Eslam Samir100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1976 users visited in the last hour