Sequence Database Without Splice Variants
1
0
Entering edit mode
10.8 years ago
Pappu ★ 2.1k

I am looking for a sequence database which does not contain shorter versions of the same protein (splice variants with >95% identity) and fragments. I also want fasta database to contain NCBI taxid of the species. Let me know if you give me some suggestions to build it from trEMBL or nr. Thanks.

database • 1.8k views
ADD COMMENT
1
Entering edit mode
10.8 years ago
Hamish ★ 3.2k

That sounds like you want some thing like the UniProt Reference Clusters (UniRef) databases. See http://www.uniprot.org/help/uniref.

The UniRef databases are derived using CD-HIT to merge splice variant (isoform) and fragment sequences, to three different levels of identity:

  • UniRef100: 100% identity
  • UniRef90: 90% identity
  • UniRef50: 50% identity

For downloads of all the UniProt databases, including the UniRef databases, see http://www.uniprot.org/downloads

ADD COMMENT
0
Entering edit mode

I actually downloaded UniRef90. It still contains entries which are termed as fragments in uniprot.

ADD REPLY
0
Entering edit mode

In cases where no full length sequence shares the threshold level of identity for the clustering, you will get clusters of fragments. Since these fragments are distinct from the available full length sequences they are informative, and depending on your requirements you will likely want to keep them. Otherwise, since they will always have a description containing the "(fragment)" keyword, you can filter them out of the downloaded data set. Either by processing the downloaded data or using a query on UniProt.org to get only the non-fragment clusters. For example:

http://www.uniprot.org/uniref/?query=NOT+name%3A%22%28fragment%29%22+AND+identity%3A0.9

ADD REPLY

Login before adding your answer.

Traffic: 1349 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6