I Work A Lot With Table-Like Structures, Should I Learn Sql?
9
5
Entering edit mode
9.2 years ago
enricoferrero ▴ 880

My typical workflow involves dealing with comma or tab separated flat text files that I process with Perl and/or R, to create dataframe/table-like structures which I then usually export to text files.

Would learning SQL be useful in such a scenario? If yes, what flavour? What advantages does SQL tables have that, for example, R data frames have not?

Thanks!

r perl mysql • 4.2k views
ADD COMMENT
8
Entering edit mode
9.2 years ago
Raygozak ★ 1.4k

I would say that it would help you a lot, while this can be done in R with dataframes, once you load the data in the database it's there for you always. Also the SQL language is easier in my opinion. Also, instead of having thousands of files and dealing with filenames, you import everything into databases and tables. It's true that there can be also thousands of tables and the naming can also be a problem, but you have everything in one place. Most if not all operations in a database are optimized for speed and data integrity, so you get that for free even though you're not using huge tables, but if you are then that's a great benefit.

you can export any query to a text file if you need to, and you can recreate that file a million times without having to remember if you put it in this disk or another. I prefer mysql, it's just a personal choice, but you could look at benchmarks if you want.

these are my 5 cents.

ADD COMMENT
4
Entering edit mode
9.2 years ago
zx8754 11k

It very much depends on the type of files you work with. Are they always same shape and form, i.e.: input and output files to and from R/Perl scripts are always the same? If yes, maybe it makes sense to keep them in database(DB). In bioinformatics data format evolves fast, you don't want to be in a situation, where finally you have a stable DB, something new comes up and requires DB to be rebuilt from scratch.

From personal experience, almost all of the genetic analysis is done outside SQL. So, I prefer to keep files in flat text files, then call up R/bash scripts to manipulate the data. On rare occasion I use mysql connectoin to UCSC tables.

If you are already familiar with programming and set theories, then it should be easy to learn SQL. If you are familiar with R, there is an sqldf package, that treats your dataframes as tables, so you can run SQL queries.

ADD COMMENT
4
Entering edit mode
9.2 years ago

Relational databases are great if you need to combine information from different tables (or dataframes in R lingo), either when you want to retrieve information from them (SELECT queries) or when you need to write down results into them (INSERTs).

Another point is how much inserts and deletes you do on your tables. If the tables change frequently in both ways, then databases are good at this. If other parties need to query your data and you want them to query one single version of it, having it in a database is also useful.

I would say MySQL is the best supported open-source relational database, but there might be other ones catching up quickly. RMySQL is specially useful if you want to interact with your data in R.

ADD COMMENT
4
Entering edit mode
9.2 years ago

Yes, if you work with tab separated files, knowing some basic sql and using SQLite3 will boost your performance. SQLite does not require any server maintenance and is the equivalent of using idexed flatfiles. If you don't have experience with databases and only want to cut your theeth with sqI, I strongly recommend start with SQLite3 unless someone else is going to maintain a mysql server for you

I am in your shoes every day and usually I do everything with unix pipes, Perl and R. But for big chip manifest data or for annotating features joining information from different data files that you need to query very often, putting everything in a database save you a lot of coding, also save you time as the sql database act as an indexed flat file, so searches of indexed columns are very quick even with millions rows data sets. You only need to create the table schema (take the headers and give the type of variable and set some indexes in order to speed the searches), do the standard import from tab or csv files (ultra fast, thousands of rows per second), and then do all the 'select a.x, a.y,a.w, b.z from a, b where a.x=b.x and a.y=b.y' that you need. I usually call these sql commands directly from the shell and redirect them to files or pipe the results to other programs. You can also call them directly from R or Perl with system calls or using the DBI interface of your programming language.

My scenario for using SQLite is when I need to read a file many times, for example genes coordinates or other data downloaded from biomart and some sets of features that I need to intersect. I use vcftools and bedtools when the data is in bed,gff or vcf but when data is in another formats or they are manifest files that I want to intersect is easier to import to SQLite and make some selects. It would take you less than one days or reading/googling about importing data and how to join databases to do intersects with sqlite. YMMV

ADD COMMENT
2
Entering edit mode
9.2 years ago
Josh Herr 5.7k

I am not a MySQL master by far, but I have found that it was easy to quickly pick up some commands to set up, start, stop, and modify existing databases using MySQL. Many programs use MySQL as a database and I do a fair amount of synteny work and SYMAP uses MySQL.

I would recommend Ben Forta's Book - MySQL Crash Course - which quickly helped me to get up to speed with MySQL basics.

ADD COMMENT
1
Entering edit mode
9.2 years ago
Richard Smith ▴ 400

If you want to ask questions of the data in the file then loading into a database table could be a big help. Once you're familiar with SQL queries it becomes very simple to e.g. search for values, compute sums or averages, find all rows with this by not that, etc. I don't use r though so I can't say how it compares.

If you are concerned more with queries on a table that long term storage you could start with SqlLite which is really simple to use, a database is stored in a single file and it is pre-installed on macs and many linux distributions. If you want large amounts of data from many different sources you might be better with mysql or postgres.

ADD COMMENT
1
Entering edit mode
5.0 years ago

Learning SQL skill is absolutely essential these days.

For someone starting out to build database / SQL skills I find the following of great use !!!

(1) Rashid Khan’s course -- https://goo.gl/h1cmb4-- “SQL Server 101 : Microsoft SQL Server for Absolute Beginners” — deeply discounted video course, for the price of a lunch, and in less than 4 Hours – extremely concise and clear video lectures :)

(2) Khan Academy -- https://www.khanacademy.org/ – free online courses covering a wide range of disciplines, including the Intro to SQL course

ADD COMMENT
0
Entering edit mode

Why should SQL be essential these days? For many biological datatypes I really don't think SQL is the best way to store data anyway. So I honestly disagree that SQL is a absolutely essential skill in bioinformatics. Last time I touched SQL is like 2.5 years ago.

ADD REPLY
0
Entering edit mode
9.2 years ago
always_learning ★ 1.1k

mySQL is very good when you have stable structure but in Bioinformatics, we can't say much on stability !! ! But in all case Learning is a good experience either !! :)

ADD COMMENT
0
Entering edit mode
9.2 years ago
Nims ▴ 90

I think question can be divided into two parts - 1) Is SQL advantageous 2) Is it needed to learn Ps & Qs of SQL. So my answer is

  1. Yes, it is advantageous. It is based on relational way of storing data rather than file storage. Added features are data analytics and consistent storage. SQL queries can help to save a lot of time when doing some analytical work.
  2. It depends. These days most of the SQL systems has GUI based editors and IDEs and most of the work can be accomplished from here. But if you really want to get deep down then you may learn it - its easy.
ADD COMMENT

Login before adding your answer.

Traffic: 1745 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6