Question: Process Efficiency Of Inputting Data Into A Database (For Snp Database)
0
gravatar for hudarembang
5.5 years ago by
hudarembang0 wrote:

In my research on bioinformatics relate to the process of moving data into the database.

Constraint occurs because the data to be inserted into the database is very large (approximately 1-2 million data), thus affecting the speed performance. Is there an algorithm that is able to assist the process of moving data faster. or is there any other solution?
-in this study i use mysql with php programming language.

This case is, I have 2 files:
1. output.vcf (2 MB)
with the contents of the data like this:

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    /home/abrari/bioinf/jobs/39/sorting/output.bam
10    389659    .    T    C    4.13    .    DP=2;RPB=9.668049e-01;AF1=0.4999;AC1=1;DP4=0,1,0,1;MQ=42;FQ=4.93;PV4=1,1,1,1    GT:PL:GQ    0/1:32,0,35:31
10    1221790    .    A    T    17.8    .    DP=2;VDB=6.560000e-02;AF1=1;AC1=2;DP4=0,0,0,2;MQ=42;FQ=-33    GT:PL:GQ    1/1:49,6,0:10
....
....

up to 20.000 lines


2. align.bam (7GB)
with the contents of the data like this:

ACTGTACGTACGGGATT.....

to enter data from case files "output.vcf" This database has been solved, but the problem that occurs is the data per row is entered into the database has the inclusion of a long process. it actually does not include the case of entering the data from the file align.bam.
simple example is like this:
from output.vcf:

9    46607305    .    A    G    222    .    DP=90;VDB=0.000000e+00;AF1=1;AC1=2;DP4=0,0,90,0;MQ=42;FQ=-282    GT:PL:GQ    1/1:255,255,0:99


from align.bam:
.....AAAAATCCCCC.....
bold letters "T" that was a position to 46607305 (refer to file output.vcf).
and some important information will be entered into the database is processed:

- 9 //is chrom
- A //ref
- G //alt
- 46607305 //is position
- AAAAA //is 5 the position of the data before the file align.bam
- CCCCC //is 5 the position of the data after the file align.bam


and is processed to row 20,000 lines.
I believe this process is very long to get into the database. due to data entry of case files output.vcf alone is long, not to mention taking the data from the file align.bam is very large file.
is there any solution that could solve this case?

database snp • 1.5k views
ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by hudarembang0
0
gravatar for Pierre Lindenbaum
5.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

are you using mysql ? You don't show use 'how' (insert statements ? bulk upload) you're data have been inserted. How you data tables have been created ( type of int, varchar, engine etc...)

Is there an algorithm that is able to assist the process of moving data faster. or is there any other solution?

yes, use mysql load data http://dev.mysql.com/doc/refman/5.1/en/load-data.html

align.bam (7GB) with the contents of the data like this:

ACTGTACGTACGGGATT.....

it doesn't look like a BAM

Are you inserting a whole BAM into a database ?(!!!!). Why would you do that ?

ADD COMMENTlink written 5.5 years ago by Pierre Lindenbaum122k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1841 users visited in the last hour