Question

Tool:Converting Nebula Genomics Data to 23andMe Format

0

Entering edit mode

12 days ago

Guillermo • 0

Hi. I've created a bash script as a guide to convert genetic data from Nebula Genomics to 23andMe format.

This script outlines the necessary steps, each of which should be executed and reviewed before proceeding to the next.

Check it out here:

This is version 0.1, and it has not been tested. It's a starting point for the community to refine.

The primary focus is on achieving the highest quality conversion possible, and any performance improvements without compromising quality are welcome.

Your feedback and suggestions are greatly appreciated!

23andMe Nebula • 239 views

ADD COMMENT • link updated 11 days ago by Michael 54k • written 12 days ago by Guillermo • 0

score 1 · Answer 1 · 2024-05-07

Hi,

Thank you for your contribution, here is your free code review:

This scenario is sort of ideal for a snakemake workflow and I think you could check if it's worth writing one.
The script should have better separation of concerns (analysis vs. installing software and dependencies)
I am very skeptical about scripts installing stuff via sudo and apt, note not everyone is running Debian
Leave the decision of how to install software to the users
All the software you are installing is available via conda. I recommend providing the dependencies as a conda env export into a yaml file or simply integrate that into the workflow.
Your WF is a basic variant calling pipeline there are may of these already, only the last step is specific. plink --file plink --recode 23 --out 23andme # this is the specific code

Provide filenames as parameters on the command line.

# Step 3: Decompress FASTQ
 gunzip -c $nebula_fastq_1 > nebula_fastq_1.fq
 gunzip -c $nebula_fastq_2 > nebula_fastq_2.fq

This is not recommended nor required, remove this step.

 # + Step 6: Generate standard VCF and gVCF
 # - Standard VCF
 samtools mpileup -uf genome.fa sorted.bam | bcftools call -mv -Ov -o variants.vcf
 # - gVCF
gatk HaplotypeCaller -R genome.fa -I sorted.bam -O output.g.vcf.gz -ERC GVCF

It is not clear, why you are running two variant callers, but only use the output of the first. I'd stick with the GATK best-practices workflows or use DeepVariant. GATK wf's include marking duplicates and base-quality recalibration (at least for human data) as well as variant filtration steps.