Question: Variant calling on single sample Macaque WGS data - advice please
0
gravatar for YaGalbi
17 months ago by
YaGalbi1.4k
Biocomputing, MRC Harwell Institute, Oxford, UK
YaGalbi1.4k wrote:

Hello all,

I will soon receive WGS data for 2 Rhesus Macaques, on which I want to call variants.

I intended follow GATK Best Practices. Creating the BAM files is fine but I'm unsure how to proceed to call somatic mutations using GATK4 Mutect2. Mutect2 requires either a tumor/sample paired set (I have 1 sample per macaque) or can be run in tumor only mode if one has (or can create) a panel of norms (PON). PONs are created with at least 40 samples - I have 2. Also, there is no known rhesus macaque variant .vcf in the public domain. I'm not sure if I should just use HaplotypeCaller or stick to Mutect2 and create a PON on the 2 samples i have.

Any advice/directions would be very welcome

EDIT: further details on experimental design added below

EDIT 2: Everyone was helpful in turning on the light here. My mistake was not understanding the difference between somatic and germline variant calling.

Regards, Kenneth

ADD COMMENTlink modified 17 months ago • written 17 months ago by YaGalbi1.4k
2

Also, there is no known rhesus macaque variant .vcf in the public domain.

Ensembl has a VCF available. Also look for a corresponding gvf here.

ADD REPLYlink modified 17 months ago • written 17 months ago by genomax74k

Good lord where have my research skills gone? Thank you.

Any ideas on how to proceed with variant calling? The vcf/gvf files are useful for base re-calibration but not sure it can help me with the main problem here.

ADD REPLYlink written 17 months ago by YaGalbi1.4k

Not much to add from my side. Someone should be a long with specific insight into your question. Take a look at strelka2 to see if that could be useful.

ADD REPLYlink modified 17 months ago • written 17 months ago by genomax74k

archived at ncbi in 2017:

ftp://ftp.ncbi.nih.gov/snp/organisms/archive/macaque_9541/VCF/
ftp://ftp.ncbi.nih.gov/snp/organisms/archive/macaque_9544/VCF/

9544 is mulatta sp. and 9541 is fascicularis sp

ADD REPLYlink modified 17 months ago • written 17 months ago by cpad011212k

Hi,could you please point out where shows 9544 is mulatta sp. and 9541 is fascicularis sp ? I couldn't find detail information about these organisms. Thank you very much !

ADD REPLYlink written 9 months ago by hxlei61390

Could you elaborate on your experimental setup? You write you want to do somatic variant calling, does that mean you are sequencing Macaques with cancer? What is the aim of your analysis?

If you have only a tumor sample and not a normal sample you will not be able to distinguish germline from acquired variants.

ADD REPLYlink written 17 months ago by WouterDeCoster42k

Sure, but bear in mind this is my first time doing this kind of analysis so my terms may be off especieally with the use of the word "somatic" v "germline". I get the difference in biology, but may be confusing them in this context.

DNA taken from blood extracted from 2 healthy rhesus macaques has been sent for whole genome sequencing (Illumina HiSeq 4000). There is no cancer or disease state at all. I have simply been requested to genotype the macaques to confirm the species (Indian or Chinese) and to possibly use as a deep sequenced data set for imputation of more macaques that may be sequenced later at less coverage.

My understanding of the steps are that for each sample separately I should:

1) Pre-process the FASTQ file to produce a BAM file (mapped, dedupped, base recalibrated)

2) Call SNPs and InDels - this is what I refer to as somatic variant calling (please correct me). HaplotypeCaller seems to be the tool that I should use as it takes a single sample, but according to GATK Best Practices it has been superceeded by Mutect2 which I don't seem to have the input for (healthy, disease, and PON). I realise their are other tools out their, but as this is my first time doing this, for my own learning, I'd rather use a well trusted/documented/published community standard tool. It doesn't really need to be the latest cream of the crop in bioinformatics innovation. I'm also aware of Kevin Blighe's suggestion and if this gets to much I just may try that instead but I'd rather stay with GATK Best Practices if it is reasonable to do so.

3) Genotype the sample based on the SNP + Indel sites

4) I know there are further analysis steps, but I will ask a separate question when I get to that stage.

Thank you guys, I really appreciate an experienced head on this.

Kenneth

ADD REPLYlink modified 17 months ago • written 17 months ago by YaGalbi1.4k
1

If I understood correctly then somatic variant calling (mutations acquired by processes in cancer) is not what you need, and you are best off using germline approaches, with GATK HaploTypecaller for example. Just calling inherited SNPs and indels is germline variant calling.

ADD REPLYlink written 17 months ago by WouterDeCoster42k

This is starting to make a lot more sense now. Thank you.

ADD REPLYlink written 17 months ago by YaGalbi1.4k
1

Germline short variant discovery (SNPs + Indels)---https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145 (best practices workflow)

ADD REPLYlink modified 17 months ago • written 17 months ago by cpad011212k

Yes, I see now that based on my misunderstanding of the difference between somatic + germline in this context, I had made the mistake of selecting "somatic snvs + indels" instead. Thank you.

ADD REPLYlink written 17 months ago by YaGalbi1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1295 users visited in the last hour