SQL configuration for sequence data
2
0
Entering edit mode
4.9 years ago
dllopezr ▴ 120

Hi everyone

I'm creating a database that will have a table for protein and dna sequence. I'm looking for advices for the config of this table in mysql in terms of the data type use (VARCHAR, TEXT, LONGTEXT) as well as the codification and collation for this table.

Can you help me with that?

Thank you so much!

sql dna protein • 1.9k views
ADD COMMENT
3
Entering edit mode
4.9 years ago

The difference between the mentioned character data types lies in the maximum size of a string one can store in the table. VARCHAR(x) limits the strings to x characters whereas TEXT has a max size of 65535 bytes and the max for LONGTEXT is 4 GB (there's also MEDIUMTEXT at 16MB). Note that for all these types, the storage space occupied by a string is the same, e.g. the string 'mystring' is always stored as 8 characters but the actual disk space required in bytes depends on the encoding. The collation is the set of rules governing string comparisons (because the order of characters is alphabet/language dependent). The collation matters for searches and index creation.

You should choose the data type and encoding that can represent all your sequences (e.g. are they going to contain funny characters?) and the collation that gives you a relevant ordering and indexing for your purpose.

For sequences without funny characters, ASCII encoding should be fine, otherwise for most West European languages, Latin1. If unicode support is needed, use UTF8. Usually encodings have an associated default collation which should be used to avoid surprises.

For more info, check the MySQL documentation on character sets and collations.

ADD COMMENT
1
Entering edit mode
4.9 years ago

from mysql ucsc:

mysql> use uniProt
mysql> show create table varProtein;
+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| Table      | Create Table                                                                                                                                   |
+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| varProtein | CREATE TABLE `varProtein` (
  `acc` char(12) NOT NULL,
  `val` longblob NOT NULL,
  PRIMARY KEY (`acc`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.16 sec)

mysql> select * from varProtein limit 10;
+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| acc      | val                                                                                                                                                                                                                                                                               |
+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| P48347-2 | MENEREKQVYLAKLSEQTERYDEMVEAMKKVAQLDVELTVEERNLVSVGYKNVIGARRASWRILSSIEQKEESKGNDENVKRLKNYRKRVEDELAKVCNDILSVIDKHLIPSSNAVESTVFFYKMKGDYYRYLAEFSSGAERKEAADQSLEAYKAAVAAAENGLAPTHPVRLGLALNFSVFYYEILNSPESACQLAKQAFDDAIAELDSLNEESYKDSTLIMQLLRDNLTLWTSDLNEEGDERTKGADEPQDEV                    |
| Q9S9Z8-2 | MENERAKQVYLAKLNEQAERYDEMVEAMKKVAALDVELTIEERNLLSVGYKNVIGARRASWRILSSIEQKEESKGNEQNAKRIKDYRTKVEEELSKICYDILAVIDKHLVPFATSGESTVFYYKMKGDYFRYLAEFKSGADREEAADLSLKAYEAATSSASTELSTTHPIRLGLALNFSVFYYEILNSPERACHLAKRAFDEAIAELDSLNEDSYKDSTLIMQLLRDNLTLWTSDLEEGGK                                 |
| Q9S9Z8-3 | MENERAKQVYLAKLNEQAERYDEMVEAMKKVAALDVELTIEERNLLSVGYKNVIGARRASWRILSSIEQKEESKGNEQNAKRIKDYRTKVEEELSKICYDILAVIDKHLVPFATSGESTVFYYKMKGDYFRYLAEFKSGADREEAADLSLKAYEAATSSASTELSTTHPIRLGLALNFSVFYYEILNSPERACHLAKRAFDEAIAELDSLNEDSYKDSTLIMQLLRDNLTLWTSDLEEGGEQSKGHNQQDEVNKI                   |
| P48349-2 | MAATLGRDQYVYMAKLAEQAERYEEMVQFMEQLVTGATPAEELTVEERNLLSVAYKNVIGSLRAAWRIVSSIEQKEESRKNDEHVSLVKDYRSKVESELSSVCSGILKLLDSHLIPSAGASESKVFYLKMKGDYHRYMAEFKSGDERKTAAEDTMLAYKAAQDIAAADMAPTHPIRLGLALNFSVFYYEILNSSDKACNMAKQAFEEAIAELDTLGEESYKDSTLIMQLLRDNLTLWTSDMQTNQMHHIRDIKEHVKTEITAKPCVLSYYYSM |
| P48348-2 | MATTLSRDQYVYMAKLAEQAERYEEMVQFMEQLVSGATPAGELTVEERNLLSVAYKNVIGSLRAAWRIVSSIEQKEESRKNEEHVSLVKDYRSKVETELSSICSGILRLLDSHLIPSATASESKVFYLKMKGDYHRYLAEFKSGDERKTAAEDTMIAYKAAQDVAVADLAPTHPIRLGLALNFSVFYYEILNSSEKACSMAKQAFEEAIAELDTLGEESYKDSTLIMQLLRDNLTLWTSDMQMDEA                            |
| P68250-2 | MDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLQLLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN                              |
| P31946-2 | MDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN                              |
| Q4R572-2 | MDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN                              |
(...)
+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
10 rows in set (0.18 sec)
ADD COMMENT
0
Entering edit mode

Hi Pierre, do you have any recomendation for the use of longblob? I'm thinking in use VARCHAR(MAX) for my sequence instead of text becuause i read this have improvements in memory and retrieve, but longblob is new to me. For example, In some situations, I would like to tell SQL that select the 'DNA' sequences that are exactly the same.

ADD REPLY
1
Entering edit mode

Hi Pierre, do you have any recomendation for the use of longblob?

depends of your needs. https://dev.mysql.com/doc/refman/8.0/en/storage-requirements.html

ADD REPLY
1
Entering edit mode

BLOB data types are normally used to store binary data (i.e. byte strings, e.g. images). They are associated with the binary encoding and collation, i.e. comparisons/searches are based on the raw numerical values of the bytes. There's nothing to gain by storing sequences as BLOB in MySQL. Relational databases engines are optimized for fast retrieval based on the use of indexes. So the first step is to index properly the columns that are going to be used for searching. Unless there are requirements you're not mentioning, it looks like you're trying to over-optimize this.

ADD REPLY

Login before adding your answer.

Traffic: 2637 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6