Question: SQL configuration for sequence data
0
gravatar for dllopezr
11 months ago by
dllopezr60
dllopezr60 wrote:

Hi everyone

I'm creating a database that will have a table for protein and dna sequence. I'm looking for advices for the config of this table in mysql in terms of the data type use (VARCHAR, TEXT, LONGTEXT) as well as the codification and collation for this table.

Can you help me with that?

Thank you so much!

dna sql protein • 282 views
ADD COMMENTlink modified 11 months ago by Pierre Lindenbaum128k • written 11 months ago by dllopezr60
3
gravatar for Jean-Karim Heriche
11 months ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche22k wrote:

The difference between the mentioned character data types lies in the maximum size of a string one can store in the table. VARCHAR(x) limits the strings to x characters whereas TEXT has a max size of 65535 bytes and the max for LONGTEXT is 4 GB (there's also MEDIUMTEXT at 16MB). Note that for all these types, the storage space occupied by a string is the same, e.g. the string 'mystring' is always stored as 8 characters but the actual disk space required in bytes depends on the encoding. The collation is the set of rules governing string comparisons (because the order of characters is alphabet/language dependent). The collation matters for searches and index creation.

You should choose the data type and encoding that can represent all your sequences (e.g. are they going to contain funny characters?) and the collation that gives you a relevant ordering and indexing for your purpose.

For sequences without funny characters, ASCII encoding should be fine, otherwise for most West European languages, Latin1. If unicode support is needed, use UTF8. Usually encodings have an associated default collation which should be used to avoid surprises.

For more info, check the MySQL documentation on character sets and collations.

ADD COMMENTlink modified 11 months ago • written 11 months ago by Jean-Karim Heriche22k
1
gravatar for Pierre Lindenbaum
11 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum128k wrote:

from mysql ucsc:

mysql> use uniProt
mysql> show create table varProtein;
+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| Table      | Create Table                                                                                                                                   |
+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| varProtein | CREATE TABLE `varProtein` (
  `acc` char(12) NOT NULL,
  `val` longblob NOT NULL,
  PRIMARY KEY (`acc`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.16 sec)

mysql> select * from varProtein limit 10;
+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| acc      | val                                                                                                                                                                                                                                                                               |
+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| P48347-2 | MENEREKQVYLAKLSEQTERYDEMVEAMKKVAQLDVELTVEERNLVSVGYKNVIGARRASWRILSSIEQKEESKGNDENVKRLKNYRKRVEDELAKVCNDILSVIDKHLIPSSNAVESTVFFYKMKGDYYRYLAEFSSGAERKEAADQSLEAYKAAVAAAENGLAPTHPVRLGLALNFSVFYYEILNSPESACQLAKQAFDDAIAELDSLNEESYKDSTLIMQLLRDNLTLWTSDLNEEGDERTKGADEPQDEV                    |
| Q9S9Z8-2 | MENERAKQVYLAKLNEQAERYDEMVEAMKKVAALDVELTIEERNLLSVGYKNVIGARRASWRILSSIEQKEESKGNEQNAKRIKDYRTKVEEELSKICYDILAVIDKHLVPFATSGESTVFYYKMKGDYFRYLAEFKSGADREEAADLSLKAYEAATSSASTELSTTHPIRLGLALNFSVFYYEILNSPERACHLAKRAFDEAIAELDSLNEDSYKDSTLIMQLLRDNLTLWTSDLEEGGK                                 |
| Q9S9Z8-3 | MENERAKQVYLAKLNEQAERYDEMVEAMKKVAALDVELTIEERNLLSVGYKNVIGARRASWRILSSIEQKEESKGNEQNAKRIKDYRTKVEEELSKICYDILAVIDKHLVPFATSGESTVFYYKMKGDYFRYLAEFKSGADREEAADLSLKAYEAATSSASTELSTTHPIRLGLALNFSVFYYEILNSPERACHLAKRAFDEAIAELDSLNEDSYKDSTLIMQLLRDNLTLWTSDLEEGGEQSKGHNQQDEVNKI                   |
| P48349-2 | MAATLGRDQYVYMAKLAEQAERYEEMVQFMEQLVTGATPAEELTVEERNLLSVAYKNVIGSLRAAWRIVSSIEQKEESRKNDEHVSLVKDYRSKVESELSSVCSGILKLLDSHLIPSAGASESKVFYLKMKGDYHRYMAEFKSGDERKTAAEDTMLAYKAAQDIAAADMAPTHPIRLGLALNFSVFYYEILNSSDKACNMAKQAFEEAIAELDTLGEESYKDSTLIMQLLRDNLTLWTSDMQTNQMHHIRDIKEHVKTEITAKPCVLSYYYSM |
| P48348-2 | MATTLSRDQYVYMAKLAEQAERYEEMVQFMEQLVSGATPAGELTVEERNLLSVAYKNVIGSLRAAWRIVSSIEQKEESRKNEEHVSLVKDYRSKVETELSSICSGILRLLDSHLIPSATASESKVFYLKMKGDYHRYLAEFKSGDERKTAAEDTMIAYKAAQDVAVADLAPTHPIRLGLALNFSVFYYEILNSSEKACSMAKQAFEEAIAELDTLGEESYKDSTLIMQLLRDNLTLWTSDMQMDEA                            |
| P68250-2 | MDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLQLLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN                              |
| P31946-2 | MDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN                              |
| Q4R572-2 | MDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN                              |
(...)
+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
10 rows in set (0.18 sec)
ADD COMMENTlink written 11 months ago by Pierre Lindenbaum128k

Hi Pierre, do you have any recomendation for the use of longblob? I'm thinking in use VARCHAR(MAX) for my sequence instead of text becuause i read this have improvements in memory and retrieve, but longblob is new to me. For example, In some situations, I would like to tell SQL that select the 'DNA' sequences that are exactly the same.

ADD REPLYlink modified 11 months ago • written 11 months ago by dllopezr60
1

Hi Pierre, do you have any recomendation for the use of longblob?

depends of your needs. https://dev.mysql.com/doc/refman/8.0/en/storage-requirements.html

ADD REPLYlink written 11 months ago by Pierre Lindenbaum128k
1

BLOB data types are normally used to store binary data (i.e. byte strings, e.g. images). They are associated with the binary encoding and collation, i.e. comparisons/searches are based on the raw numerical values of the bytes. There's nothing to gain by storing sequences as BLOB in MySQL. Relational databases engines are optimized for fast retrieval based on the use of indexes. So the first step is to index properly the columns that are going to be used for searching. Unless there are requirements you're not mentioning, it looks like you're trying to over-optimize this.

ADD REPLYlink written 11 months ago by Jean-Karim Heriche22k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1124 users visited in the last hour