Question

NetNGlyc: input sequence names are not unique: What means?

0

Entering edit mode

3.1 years ago

mauricio.1313 • 0

Hi!

I´m working with NetNGlyc 1.0 server to predict N-glycosylation sites in protein sequences. I had to look at my sequences in case there is a repeat sequence or if the length of my sequences is different, however, this is no the case. Does anyone know what means:

input sequence names are not unique??

Any comment is welcome!

Thank!

gene sequence glycosilation site-predictor • 1.3k views

ADD COMMENT • link updated 3.1 years ago by Mensur Dlakic ★ 27k • written 3.1 years ago by mauricio.1313 • 0

score 2 · Accepted Answer · 2021-03-19

2

Entering edit mode

3.1 years ago

GenoMax 141k

Generally many programs will ignore fasta headers past first white space. For example

>Protein1 Version_1
TYEHSTSU
>Protein1 Version_2
HKHKEYSI

would not be considered unique even though the sequence is different. The names are truncated to Protein1 in both cases making them non-unique. You can replace the spaces with _ so the names become unique.

Not sure if that is what is happening with NetNGlyc but something you can check on.

ADD COMMENT • link 3.1 years ago by GenoMax 141k

0

Entering edit mode

This is a great explanation, however, I check my data and this is no the case.

Anyway, thanks for the comment!

ADD REPLY • link 3.1 years ago by mauricio.1313 • 0

1

Entering edit mode

Another thing to consider is some programs may even go further and consider certain number of characters for names. For example

>Protein1_Version_1
TYEHSTSU
>Protein1_Version_2
HKHKEYSI

If only first 8 characters or less are considered in NAME field then these two names become non-unique.

ADD REPLY • link 3.1 years ago by GenoMax 141k

0

Entering edit mode

My data have this "ID problem", I check and correct this, however, the predictor keep send the same message: input sequence names are not unique

This is strange!

ADD REPLY • link 3.1 years ago by mauricio.1313 • 0

score 2 · Accepted Answer · 2021-03-19

however, I check my data and this is no the case.

Sequence parsers usually don't throw this error unless there is a problem. You can test this yourself, but in my experience computers tend to be more careful in checking these things then people.

Here is what you can do, assuming your FASTA file is called sequences.fas:

grep ">" sequences.fas | awk '{print $1}' | wc -l
grep ">" sequences.fas | awk '{print $1}' | sort -u | wc -l

These two commands will print out two numbers. If the first number is larger than the next, your sequence names are not unique.

If the numbers were equal in previous exercise, try pasting this line:

for i in {3..30}; do grep ">" sequences.fas | cut -c 1-$i | wc -l && grep ">" sequences.fas | cut -c 1-$i | sort -u | wc -l && echo "" ; done

This will print a series of two numbers, separated by empty lines. If at any point the two numbers are not identical, your sequence names are not unique.