Question: SQL configuration for sequence data
0
gravatar for dllopezr
26 days ago by
dllopezr40
dllopezr40 wrote:

Hi everyone

I'm creating a database that will have a table for protein and dna sequence. I'm looking for advices for the config of this table in mysql in terms of the data type use (VARCHAR, TEXT, LONGTEXT) as well as the codification and collation for this table.

Can you help me with that?

Thank you so much!

dna sql protein • 141 views
ADD COMMENTlink modified 26 days ago by Pierre Lindenbaum121k • written 26 days ago by dllopezr40
3
gravatar for Jean-Karim Heriche
26 days ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche20k wrote:

The difference between the mentioned character data types lies in the maximum size of a string one can store in the table. VARCHAR(x) limits the strings to x characters whereas TEXT has a max size of 65535 bytes and the max for LONGTEXT is 4 GB (there's also MEDIUMTEXT at 16MB). Note that for all these types, the storage space occupied by a string is the same, e.g. the string 'mystring' is always stored as 8 characters but the actual disk space required in bytes depends on the encoding. The collation is the set of rules governing string comparisons (because the order of characters is alphabet/language dependent). The collation matters for searches and index creation.

You should choose the data type and encoding that can represent all your sequences (e.g. are they going to contain funny characters?) and the collation that gives you a relevant ordering and indexing for your purpose.

For sequences without funny characters, ASCII encoding should be fine, otherwise for most West European languages, Latin1. If unicode support is needed, use UTF8. Usually encodings have an associated default collation which should be used to avoid surprises.

For more info, check the MySQL documentation on character sets and collations.

ADD COMMENTlink modified 26 days ago • written 26 days ago by Jean-Karim Heriche20k
1
gravatar for Pierre Lindenbaum
26 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

from mysql ucsc:

mysql> use uniProt
mysql> show create table varProtein;
+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| Table      | Create Table                                                                                                                                   |
+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
| varProtein | CREATE TABLE `varProtein` (
  `acc` char(12) NOT NULL,
  `val` longblob NOT NULL,
  PRIMARY KEY (`acc`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.16 sec)

mysql> select * from varProtein limit 10;
+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| acc      | val                                                                                                                                                                                                                                                                               |
+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| P48347-2 | MENEREKQVYLAKLSEQTERYDEMVEAMKKVAQLDVELTVEERNLVSVGYKNVIGARRASWRILSSIEQKEESKGNDENVKRLKNYRKRVEDELAKVCNDILSVIDKHLIPSSNAVESTVFFYKMKGDYYRYLAEFSSGAERKEAADQSLEAYKAAVAAAENGLAPTHPVRLGLALNFSVFYYEILNSPESACQLAKQAFDDAIAELDSLNEESYKDSTLIMQLLRDNLTLWTSDLNEEGDERTKGADEPQDEV                    |
| Q9S9Z8-2 | MENERAKQVYLAKLNEQAERYDEMVEAMKKVAALDVELTIEERNLLSVGYKNVIGARRASWRILSSIEQKEESKGNEQNAKRIKDYRTKVEEELSKICYDILAVIDKHLVPFATSGESTVFYYKMKGDYFRYLAEFKSGADREEAADLSLKAYEAATSSASTELSTTHPIRLGLALNFSVFYYEILNSPERACHLAKRAFDEAIAELDSLNEDSYKDSTLIMQLLRDNLTLWTSDLEEGGK                                 |
| Q9S9Z8-3 | MENERAKQVYLAKLNEQAERYDEMVEAMKKVAALDVELTIEERNLLSVGYKNVIGARRASWRILSSIEQKEESKGNEQNAKRIKDYRTKVEEELSKICYDILAVIDKHLVPFATSGESTVFYYKMKGDYFRYLAEFKSGADREEAADLSLKAYEAATSSASTELSTTHPIRLGLALNFSVFYYEILNSPERACHLAKRAFDEAIAELDSLNEDSYKDSTLIMQLLRDNLTLWTSDLEEGGEQSKGHNQQDEVNKI                   |
| P48349-2 | MAATLGRDQYVYMAKLAEQAERYEEMVQFMEQLVTGATPAEELTVEERNLLSVAYKNVIGSLRAAWRIVSSIEQKEESRKNDEHVSLVKDYRSKVESELSSVCSGILKLLDSHLIPSAGASESKVFYLKMKGDYHRYMAEFKSGDERKTAAEDTMLAYKAAQDIAAADMAPTHPIRLGLALNFSVFYYEILNSSDKACNMAKQAFEEAIAELDTLGEESYKDSTLIMQLLRDNLTLWTSDMQTNQMHHIRDIKEHVKTEITAKPCVLSYYYSM |
| P48348-2 | MATTLSRDQYVYMAKLAEQAERYEEMVQFMEQLVSGATPAGELTVEERNLLSVAYKNVIGSLRAAWRIVSSIEQKEESRKNEEHVSLVKDYRSKVETELSSICSGILRLLDSHLIPSATASESKVFYLKMKGDYHRYLAEFKSGDERKTAAEDTMIAYKAAQDVAVADLAPTHPIRLGLALNFSVFYYEILNSSEKACSMAKQAFEEAIAELDTLGEESYKDSTLIMQLLRDNLTLWTSDMQMDEA                            |
| P68250-2 | MDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLQLLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN                              |
| P31946-2 | MDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN                              |
| Q4R572-2 | MDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN                              |
(...)
+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
10 rows in set (0.18 sec)
ADD COMMENTlink written 26 days ago by Pierre Lindenbaum121k

Hi Pierre, do you have any recomendation for the use of longblob? I'm thinking in use VARCHAR(MAX) for my sequence instead of text becuause i read this have improvements in memory and retrieve, but longblob is new to me. For example, In some situations, I would like to tell SQL that select the 'DNA' sequences that are exactly the same.

ADD REPLYlink modified 26 days ago • written 26 days ago by dllopezr40
1

Hi Pierre, do you have any recomendation for the use of longblob?

depends of your needs. https://dev.mysql.com/doc/refman/8.0/en/storage-requirements.html

ADD REPLYlink written 25 days ago by Pierre Lindenbaum121k
1

BLOB data types are normally used to store binary data (i.e. byte strings, e.g. images). They are associated with the binary encoding and collation, i.e. comparisons/searches are based on the raw numerical values of the bytes. There's nothing to gain by storing sequences as BLOB in MySQL. Relational databases engines are optimized for fast retrieval based on the use of indexes. So the first step is to index properly the columns that are going to be used for searching. Unless there are requirements you're not mentioning, it looks like you're trying to over-optimize this.

ADD REPLYlink written 25 days ago by Jean-Karim Heriche20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 928 users visited in the last hour