Process Efficiency Of Inputting Data Into A Database (For Snp Database)
2
0
Entering edit mode
7.3 years ago

In my research on bioinformatics relate to the process of moving data into the database.

Constraint occurs because the data to be inserted into the database is very large (approximately 1-2 million data), thus affecting the speed performance. Is there an algorithm that is able to assist the process of moving data faster. or is there any other solution?
-in this study i use mysql with php programming language.

This case is, I have 2 files:
1. output.vcf (2 MB)
with the contents of the data like this:

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    /home/abrari/bioinf/jobs/39/sorting/output.bam
10    389659    .    T    C    4.13    .    DP=2;RPB=9.668049e-01;AF1=0.4999;AC1=1;DP4=0,1,0,1;MQ=42;FQ=4.93;PV4=1,1,1,1    GT:PL:GQ    0/1:32,0,35:31
10    1221790    .    A    T    17.8    .    DP=2;VDB=6.560000e-02;AF1=1;AC1=2;DP4=0,0,0,2;MQ=42;FQ=-33    GT:PL:GQ    1/1:49,6,0:10
....
....

up to 20.000 lines


2. align.bam (7GB)
with the contents of the data like this:

ACTGTACGTACGGGATT.....


to enter data from case files "output.vcf" This database has been solved, but the problem that occurs is the data per row is entered into the database has the inclusion of a long process. it actually does not include the case of entering the data from the file align.bam.
simple example is like this:
from output.vcf:

9    46607305    .    A    G    222    .    DP=90;VDB=0.000000e+00;AF1=1;AC1=2;DP4=0,0,90,0;MQ=42;FQ=-282    GT:PL:GQ    1/1:255,255,0:99


from align.bam:
.....AAAAATCCCCC.....
bold letters "T" that was a position to 46607305 (refer to file output.vcf).
and some important information will be entered into the database is processed:

- 9 //is chrom
- A //ref
- G //alt
- 46607305 //is position
- AAAAA //is 5 the position of the data before the file align.bam
- CCCCC //is 5 the position of the data after the file align.bam


and is processed to row 20,000 lines.
I believe this process is very long to get into the database. due to data entry of case files output.vcf alone is long, not to mention taking the data from the file align.bam is very large file.
is there any solution that could solve this case?

database snp • 1.7k views
0
Entering edit mode
7.3 years ago

are you using mysql ? You don't show use 'how' (insert statements ? bulk upload) you're data have been inserted. How you data tables have been created ( type of int, varchar, engine etc...)

Is there an algorithm that is able to assist the process of moving data faster. or is there any other solution?

yes, use mysql load data http://dev.mysql.com/doc/refman/5.1/en/load-data.html

align.bam (7GB) with the contents of the data like this:

ACTGTACGTACGGGATT.....

it doesn't look like a BAM

Are you inserting a whole BAM into a database ?(!!!!). Why would you do that ?