Question

How to create a dna database that can be analyze with R ?

0

Entering edit mode

4.5 years ago

Gautier • 0

Hi, I'm not a computer scientist and I only have basics knowledge in bioinformatics. But I would like to create a private database with all my dna sequences obtained through next gene sequencing. The idea is to be able to process all my sequences through Rstudio. The database should be able to carry all of my sequences which means a library of 400 samples. Each sample is constituted of 200 000 rows and 350 columns.

So how can I create such a database that will be easily manageable and that I can call and analyse with R ? Thank you in advance for your help.

sequencing DNA Database R NGS • 1.9k views

ADD COMMENT • link updated 4.5 years ago by Pierre Lindenbaum 161k • written 4.5 years ago by Gautier • 0

1

Entering edit mode

Any reason you want to go through the trouble to create a new data/file format and not use existing bioinformatics software and formats?

ADD REPLY • link 4.5 years ago by WouterDeCoster 47k

0

Entering edit mode

What kind of software would you recommand ? I was wondering what would be the easiest way to have all my data structured in a unique base that could be process with R.

ADD REPLY • link 4.5 years ago by Gautier • 0

0

Entering edit mode

What do you want to achieve?

ADD REPLY • link 4.5 years ago by WouterDeCoster 47k

0

Entering edit mode

So, I'll have to :

Find similarities between the sequence of my sample.
Calculate frequencies, indicies, enrichment
Plot the result
Get statistical analysis

All of this will be based on the sequences of my sample.

ADD REPLY • link 4.5 years ago by Gautier • 0

0

Entering edit mode

Store the data in a database, e.g.: sqlite, then we can import chunks of data using sqldf package.

But I'd rather look for existing solutions (including non R solutions).

What the rows and columns represent, what kind of data? If the files are standard, maybe no need for database, and use fast read and write to access the data directly from files, see data.table::fread, fwrite.

ADD REPLY • link 4.5 years ago by zx8754 11k

0

Entering edit mode

I heard that SQL was not the best base for a dynamic database (I will have to upload arround 10 new samples per week). Moreover, I'm not familiar with SQL.

I could keep each file without any database but with 400 samples, I believe that 400 files will not suite well my analysis. I also need to look for similarities between files without knowing each file to compare.

I need to perform statistical analysis and I'm familiar with R. That s why I would like the files (or database) to be easily called through R.

ADD REPLY • link 4.5 years ago by Gautier • 0

2

Entering edit mode

I heard that SQL was not the best base for a dynamic database

ADD REPLY • link 4.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Isn't it the case ? As I said I'm not against the use of SQL but I would just like to start with the right thing in order not to loose time.

ADD REPLY • link 4.5 years ago by Gautier • 0

0

Entering edit mode

Based on your description ('a library of 400 samples. Each sample is constituted of 200 000 rows and 350 columns'), I didn't see anything special in your datasets. Thus, I assume any relational database will work, such as MySQL. Probably even plain csv files will work.

ADD REPLY • link 4.5 years ago by shoujun.gu ▴ 350

score 1 · Answer 1 · 2019-10-08

1

Entering edit mode

4.5 years ago

Pierre Lindenbaum 161k

use sqlite3 or any other sql database and store your data using this SQL engine.

ADD COMMENT • link 4.5 years ago by Pierre Lindenbaum 161k