Question

How to call denovo and rare variants from a trios cohort?

8

Entering edit mode

4.6 years ago

DareDevil ★ 4.4k

Hi all, I have two objectives for my analysis:

To find out denovo and rare variants from autism trios analysis Trios: (Father, Mother, Proband)
To detect frequently occuring mutation across autism patient.

I have used GATK best practice to identify the germline variants across 15 trios (Total 45 samples [15X3]) and I have generated gVCF file for all the 45 samples.

Can some one guide me how to call de novo variants from this trios cohort?

gatk gvcf trio denovo variants • 3.9k views

ADD COMMENT • link updated 4.6 years ago by Jeremy Leipzig 23k • written 4.6 years ago by DareDevil ★ 4.4k

0

Entering edit mode

4.6 years ago

Jeremy Leipzig 23k

You'll need a VCF with the trio(s), not 45 gVCFs, so GenotypeGVCFs those together (or at least into 15 trio vcfs)

Annotate with snpEff

java -Xmx4G -jar snpEff.jar -i vcf -o vcf GRCh37.69 trio.vcf > trio_snpeff.vcf

and then to get variant frequencies, annotate with gnomAD using VCFanno vcfanno: issues to annotate VCF with gnomAD r3.0

Gemini is a nice tool to query VCFs

gemini load -v trio_snpeff.vcf -p trio.ped trio.db
gemini de_novo --columns "chrom,start,end,etc..." trio.db

Bear in mind in this approach you'll detect a ton of de novo mutations, but only 50 or so in a genome or just 1 or 2 in an exome are real. So you will likely have to use a lot of filters to get rid of artefacts. In most papers they still Sanger the candidates.

ADD COMMENT • link 4.6 years ago by Jeremy Leipzig 23k

0

Entering edit mode

@Jeremy Leipzig

Are there much difference between gVCF and VCF.

In GATK forum they appear to be similar for me.

Are you sure the analysis can not be carried out with gVCF ?

Thanks

ADD REPLY • link 4.6 years ago by DareDevil ★ 4.4k

0

Entering edit mode

a gVCF can only handle one individual, so you couldn't use them call trios with tools that expect a trio VCF (correction: yes there are multi-sample gVCFs that briefly arise from CombineGVCFs but they are never found in the wild)

ADD REPLY • link 4.6 years ago by Jeremy Leipzig 23k

1

Entering edit mode

Are there any tools to convert gvcf to vcf file. I found another question here

They use following command

gzip -dc sample.genome.vcf.gz | extract_variants | bgzip -c > sample.vcf.gz

is it fine to do like this?

@Jeremy Leipzig

@ Pierre Lindenbaum

ADD REPLY • link 4.6 years ago by DareDevil ★ 4.4k

0

Entering edit mode

you said

I have used GATK best practice to identify the germline variants across 15 trios

and then

Are there any tools to convert gvcf to vcf file

?????!!!!

ADD REPLY • link 4.6 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Hi @Pierre Lindenbaum, I used following command to generate the vcf files. Is it correct to call the vcf files this way

Variant calling using Haplotype caller algorithm

gatk HaplotypeCaller --input sample1_sorted_int.deduped.bqsr.bam --reference $reference --emit-ref-confidence GVCF --dbsnp $DBSNP --output sample1_genome.vcf.gz -L $interval

Calling only variants from a Genome VCF file

gatk GenotypeGVCFs -R $reference --variant sample1_genome.vcf.gz -O sample1_genotypeGvcf.vcf -L $interval

ADD REPLY • link 4.6 years ago by DareDevil ★ 4.4k

0

Entering edit mode

Is it correct to call the vcf files this way

no you first need to call CombineGVCFs for all your samples and then GenotypeGVCFs

ADD REPLY • link 4.6 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

@Pierre Lindenbaum

Can you clarify my doubt ? there are gvcf files from mother, father and proband. (15*3=45 files)

How I need to combine the gvcf file:

Combine all 45 gvcf files
Combine 15 proband's gvcf files
Combine gvcf of single trios, similarly for all 15 trios separately ?

ADD REPLY • link 4.6 years ago by DareDevil ★ 4.4k

0

Entering edit mode

combine ALL files.

ADD REPLY • link 4.6 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

@Pierre Lindenbaum, I appreciate your help

Hi, I tried combinegvcf followed by GenotypeGVCFs

gatk CombineGVCFs \
    -R $reference \
    -L $interval \
    --variant 001C1.vcf.gz \
    --variant 001c2.vcf.gz \
    --variant 001P1.vcf.gz \
        .....
        .....
        -O cohort.g.vcf.gz

gatk --java-options "-Xmx16g" GenotypeGVCFs  --reference $reference --variant cohort.g.vcf.gz --output output.vcf.gz --dbsnp $DBSNP --intervals $interval

It gave me an output like this,

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  001C1   001P1   001P2   001c2   003C    003P1   003P2   004C    004P1   004P2   007C    007P1   007P2   009C    009P1   009P2   010C    010P1   010P2   013C    013P1   013P2   022C    022P1   022P2   023C    023P1   023P2   030C    030P1   030P2   034C    034P1   034P2   036C    036P1   036P2   042C    042P1   042P2   044C    044P1   044P2
chr1    13273   .   G   C   1660.27 .   AC=11;AF=0.196;AN=56;BaseQRankSum=0.464;DP=295;ExcessHet=0.0920;FS=0.000;InbreedingCoeff=0.3988;MLEAC=17;MLEAF=0.304;MQ=42.17;MQRankSum=-1.139e+00;QD=13.61;ReadPosRankSum=0.848;SOR=0.680  GT:AD:DP:GQ:PL  0/0:9,0:9:0:0,0,170 0/1:8,7:15:99:135,0,153 1/1:0,23:23:69:601,69,0 0/1:12,6:18:94:94,0,270 0/0:14,0:14:42:0,42,376 1/1:0,11:11:33:277,33,0 0/0:20,0:20:60:0,60,537 0/0:4,0:4:12:0,12,115   0/1:6,5:11:97:97,0,115  0/0:2,0:2:6:0,6,51  0/0:8,0:8:24:0,24,199   ./.:0,0:0:.:0,0,0   0/0:15,0:15:45:0,45,363 0/0:11,0:11:33:0,33,292 0/0:14,0:14:42:0,42,349 ./.:0,0:0:.:0,0,0   0/0:9,0:9:27:0,27,227   0/0:9,0:9:27:0,27,249   0/0:6,0:6:18:0,18,150   0/0:6,0:6:18:0,18,155   0/0:7,0:7:21:0,21,174   0/1:5,9:14:82:178,0,82  ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0   0/0:7,0:7:21:0,21,203   ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0   0/0:3,0:3:9:0,9,81  ./.:0,0:0:.:0,0,0   0/0:2,0:2:6:0,6,66  0/0:14,0:14:42:0,42,382 ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0   0/0:8,0:8:24:0,24,190   ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0   0/0:5,0:5:15:0,15,125   ./.:0,0:0:.:0,0,0   1/1:0,7:7:21:180,21,0   0/1:15,8:23:99:129,0,357    ./.:0,0:0:.:0,0,0

How do I call denovo variants from each trio set here?

ADD REPLY • link 4.5 years ago by DareDevil ★ 4.4k

0

Entering edit mode

see my answer. I told you to use -A PossibleDeNovo

ADD REPLY • link 4.5 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

May I know whether the pedigree file which I created is correct?

FamID   IndID   PatID   MatID   Sex     Phenotype
1       in1     Pat1    Mat1    1       2
1       Pat1    0       0       1       0
1       Mat1    0       0       2       0
2       in2     Pat2    Mat2    1       2
2       Pat2    0       0       1       0
2       Mat2    0       0       2       0

where Pat is fatherID;

Mat is MotherID;

in is childID

ADD REPLY • link 4.5 years ago by DareDevil ★ 4.4k

2

Entering edit mode

Ya, this format is correct, but you need to add # in the first line as:

#FamID   IndID   PatID   MatID   Sex     Phenotype
1       in1     Pat1    Mat1    1       2
1       Pat1    0       0       1       0
1       Mat1    0       0       2       0
2       in2     Pat2    Mat2    1       2
2       Pat2    0       0       1       0
2       Mat2    0       0       2       0

ADD REPLY • link 4.5 years ago by svp ▴ 690

1

Entering edit mode

why don't you just try to use it ?

ADD REPLY • link 4.5 years ago by Pierre Lindenbaum 166k

score 3 · Accepted Answer · 2020-12-04

3

Entering edit mode

4.6 years ago

Pierre Lindenbaum 166k

run GATK VariantAnnotator with '-A PossibleDeNovo --pedigree the.ped '

ADD COMMENT • link 4.6 years ago by Pierre Lindenbaum 166k