NetNGlyc: input sequence names are not unique: What means?
2
0
Entering edit mode
3.1 years ago

Hi!

I´m working with NetNGlyc 1.0 server to predict N-glycosylation sites in protein sequences. I had to look at my sequences in case there is a repeat sequence or if the length of my sequences is different, however, this is no the case. Does anyone know what means:

input sequence names are not unique??

Any comment is welcome!

Thank!

gene sequence glycosilation site-predictor • 1.3k views
ADD COMMENT
2
Entering edit mode
3.1 years ago
GenoMax 141k

Generally many programs will ignore fasta headers past first white space. For example

>Protein1 Version_1
TYEHSTSU
>Protein1 Version_2
HKHKEYSI

would not be considered unique even though the sequence is different. The names are truncated to Protein1 in both cases making them non-unique. You can replace the spaces with _ so the names become unique.

Not sure if that is what is happening with NetNGlyc but something you can check on.

ADD COMMENT
0
Entering edit mode

This is a great explanation, however, I check my data and this is no the case.

Anyway, thanks for the comment!

ADD REPLY
1
Entering edit mode

Another thing to consider is some programs may even go further and consider certain number of characters for names. For example

>Protein1_Version_1
TYEHSTSU
>Protein1_Version_2
HKHKEYSI

If only first 8 characters or less are considered in NAME field then these two names become non-unique.

ADD REPLY
0
Entering edit mode

My data have this "ID problem", I check and correct this, however, the predictor keep send the same message: input sequence names are not unique

This is strange!

ADD REPLY
2
Entering edit mode
3.1 years ago
Mensur Dlakic ★ 27k

however, I check my data and this is no the case.

Sequence parsers usually don't throw this error unless there is a problem. You can test this yourself, but in my experience computers tend to be more careful in checking these things then people.

Here is what you can do, assuming your FASTA file is called sequences.fas:

grep ">" sequences.fas | awk '{print $1}' | wc -l
grep ">" sequences.fas | awk '{print $1}' | sort -u | wc -l

These two commands will print out two numbers. If the first number is larger than the next, your sequence names are not unique.

If the numbers were equal in previous exercise, try pasting this line:

for i in {3..30}; do grep ">" sequences.fas | cut -c 1-$i | wc -l && grep ">" sequences.fas | cut -c 1-$i | sort -u | wc -l && echo "" ; done

This will print a series of two numbers, separated by empty lines. If at any point the two numbers are not identical, your sequence names are not unique.

ADD COMMENT

Login before adding your answer.

Traffic: 1531 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6