Question

Mapping canine whole genome sequencing to camfam3 reference genome on BWA-MEM

0

Entering edit mode

4.0 years ago

rtyb91 ▴ 30

Dear Community,

I am trying to detect variants in a specific breed of dog where we hypothesis the variants may be involve in the pathogenesis of a certain disease. With that aim, we performed whole genome sequencing of 3 unrelated dogs with the disease, and plan to compare the sequencing of these samples against the variants of dogs available on public servers (590 European and 172 Chinese dog variant data) to see if we can pick up variants specific to our target breed.

I am using the Linux interface via Ubuntu in a Windows10 terminal (16GB RAM and i7 quad core) because my MacBook with 16GB RAM and only dual core may not be able to handle the amount of workload from the data.

So back to my question, I have tried reading the manual on bwa but I simply couldn't comprehend how should I be loading the reference genome or my samples into BWA-MEM program in Ubuntu. I currently have the reference genome in 5 different files in gbff format and 3 pairs of whole genome sequence files (total of 6 files) as they are paired reads, all in gz format. What my plan is to align the sequences against the reference, then call and filter the variants in GATK using 590+172 dog variants sources to detect the useful SNPs in my sample data.

I am a clinician by training and bioinformatic Is really something I have only picked up since December 2019, without any single background of programming training. Is there a step by step manual, book, link or any reference which would help because I have tried google, it didn't help much in answering my questions.

Lastly, I know I am not supposed to ask more questions than 1, but I am just trying my luck here. With consideration of the limited space in my windows 10 (1TB) and my total files needed to get my SNPs (1GB from Camfam3, 250GB from WGS, 800GB from European variant and 150GB from Chinese variant), I am seriously considering in investing an external SSD of 2TB or more with an Ubuntu booted in the SSD so I can dedicate that external hard disc to my WGS studies alone. Will that be a wise decision?

Thank you very much and so sorry for the inconvenience cause.

snp ngs whole genome sequencing • 1.3k views

ADD COMMENT • link updated 4.0 years ago by igor 13k • written 4.0 years ago by rtyb91 ▴ 30

score 1 · Answer 1 · 2020-05-03

1

Entering edit mode

4.0 years ago

igor 13k

I currently have the reference genome in 5 different files in gbff format

The reference should be in FASTA format.

Is there a step by step manual, book, link or any reference which would help

GATK Best Practices is the best guide and they provide individual steps: https://gatk.broadinstitute.org/hc/en-us/categories/360002302312 Also, check their event pages where they provide supplementary presentations, such as: https://broad.io/GATK2002

Finally, I would advise against WGS on a single desktop machine. The process could easily take over a week per sample. Since you are doing this for the first time, you will likely need to repeat some steps.

ADD COMMENT • link 4.0 years ago by igor 13k

0

Entering edit mode

Thanks for the tips Igor!

Now I need to find a way to convert the files to FASTA format instead.

As for running WGS on single desktop machine, if you were to advise against it, may I ask for some suggestions to how should I approach this?

Thank you!

ADD REPLY • link 4.0 years ago by rtyb91 ▴ 30

0

Entering edit mode

You should not need to convert files. All genomes should have a FASTA available already.

You can try running the analysis in the cloud. Many people use AWS. GATK has some tutorials on running it on Google Cloud.

ADD REPLY • link 4.0 years ago by igor 13k

0

Entering edit mode

Hmm, I wonder where the FASTA file chunk of the CanFam3.1 went to. I will dig it up soon!

I see, so you would recommend running the samples on clouds instead! Will that be efficient for a dual boot Windows with quad core and 16GB RAM?

ADD REPLY • link 4.0 years ago by rtyb91 ▴ 30