Entering edit mode
5.6 years ago
Nicolas Rosewick
11k
We have a bunch of WGS samples and would like to import them in genomicsDBimport before joint genotyping. We are for this project interested in coding sequences. Is it better :
To use
-L
with gencode coding sequences annotation and put--merge-input-intervals
to TRUETo split the analysis and execute one instance of genomicsDBimport per chromosome (e.g.
-L chr1
). My idea would be to use a job-array on my local slurm cluster (one job per chromosome). But what about the merging ? Should I put the same --genomicsdb-workspace-path
for all jobs then ?
version of GATK : 4.1
Thank
I had asked a similar question to GATK help/discussion community. From the answers I gathered, looks like it is not recommended to have discontinuous intervals. Actually, they suggested that it would be best that the smallest interval is one whole chromosome. This would avoid problems at the edges of different intervals because GATK is doing local assemblies for each variant site. For merging, I would merge the results at final joint-called VCF level.
Hello, can you give more details about WGS interval? Do I need to run
genomicsDBimport
command seperately for each chromosome? If yes do I need to use different workspace(--genomicsdb-workspace-path
)?Running these steps for each chromosome is largely because there is no enough computational resources for running the entire genome in one shot. If you do run them separately, I think you need to run it in separate commands and use different workspace path.
Will this also be the case for exome data? Ideally I'd like to run all chr's at once too.
A second question, what would the syntax be for the X and Y chr's - Is it chrX, chrY or X, Y?
If you do have to do them all separately, can they all be gathered up and easily studied together when joint-called using GenotypeGVCFs?
It depends on the version of reference genome you used. It should match the name of the chromosome in the reference genome.
did you ever get a final answer to this?
I also have similar question. I sliced the genomic bed file with 50kb windows and 1kb padding into ~700 bed files; each bed file contains 90 windows. I want to run
GenomicsDBImport
for each of these interval bed files separately and create a database for ~1500 WGS GVCF files and store in my database directory using--genomicsdb-workspace-path
command. For example, I use chr1-0_chr1-4411000.bed file and create a database for this bed file by--genomicsdb-workspace-path /GenomicsDBImport/my_databases/chr1-0_chr1-4411000
and create chr1-0_chr1-4411000 directory. At the end, I will have ~700 directories. Then, for each of these ~700 databases, I will runGenotypeGVCFs
for these ~700 databases separately and mege all outputs after. Do you think it is possible to do this way? Or do you have any suggestions?Note, I ran interval for example for whole chr22 and took me very long time to finish. I created smaller bed files to run in parallel to decrease computation time.
This is the head of one of my interval file: chr1-0_chr1-4411000.bed (includes 90 lines).
GenomicsDBImport commands: