Entering edit mode
5.0 years ago
Nicolas Rosewick
10k
We have a bunch of WGS samples and would like to import them in genomicsDBimport before joint genotyping. We are for this project interested in coding sequences. Is it better :
To use
-L
with gencode coding sequences annotation and put--merge-input-intervals
to TRUETo split the analysis and execute one instance of genomicsDBimport per chromosome (e.g.
-L chr1
). My idea would be to use a job-array on my local slurm cluster (one job per chromosome). But what about the merging ? Should I put the same --genomicsdb-workspace-path
for all jobs then ?
version of GATK : 4.1
Thank
I had asked a similar question to GATK help/discussion community. From the answers I gathered, looks like it is not recommended to have discontinuous intervals. Actually, they suggested that it would be best that the smallest interval is one whole chromosome. This would avoid problems at the edges of different intervals because GATK is doing local assemblies for each variant site. For merging, I would merge the results at final joint-called VCF level.
Hello, can you give more details about WGS interval? Do I need to run
genomicsDBimport
command seperately for each chromosome? If yes do I need to use different workspace(--genomicsdb-workspace-path
)?Running these steps for each chromosome is largely because there is no enough computational resources for running the entire genome in one shot. If you do run them separately, I think you need to run it in separate commands and use different workspace path.
Will this also be the case for exome data? Ideally I'd like to run all chr's at once too.
A second question, what would the syntax be for the X and Y chr's - Is it chrX, chrY or X, Y?
If you do have to do them all separately, can they all be gathered up and easily studied together when joint-called using GenotypeGVCFs?
It depends on the version of reference genome you used. It should match the name of the chromosome in the reference genome.
did you ever get a final answer to this?