troubleshooting benchmarking small variants: hap.py and rtg
0
1
Entering edit mode
8 weeks ago
emmanouil.a ▴ 80

Hi!

I tried to do what other posts reported and I have a problem that I do not fully understand why ...

1) I downloaded the fastq files from Garvan (https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/) with the bed file. I had to convert the bed file to hg38 (my_regions) ... as I understand it is in hg19.

2) I get the vcf (truth.vcf) and high confidence (confidence.bed) files from here https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/latest/GRCh38/

3) I ran the fastq files with the GATK codes, up to HaplotypeCaller. Garvan use 2 libraries for the same sample. I used them separately, I did not merg them at the end, before the variant calling.

4) To compere my query data (output of HaplotypeCaller) with the GIAB truth test I ran these two codes, to have also a .html file:

hap.py ${truth.vcf}${query.vcf} -f ${confident} -o /hap.py_results/${outname} -r ${Hg38} python rep.py -o /hap.py_results/"${outname}.html" "${outname}"_hap.py:/hap.py_results/"${outname}.roc.all.csv.gz"

5) I re-ran my results also with a similar code ... I found and tested it in a GA4GH tutorial:

hap.py ${truth.vcf}${query.vcf} -f ${confident.bed} -o /hap.py_results/"${outname}" --engine=vcfeval --engine-vcfeval-path=rtg --no-decompose --no-leftshift

6) I ran also

rtg vcfeval -b ${truth.vcf} -c${query.vcf} -o /hap.py_results/rtgHC/ -t ${ref_sdf} -e${confidence.bed} --region=${my_regions} Error: After applying regions there were no sequence names in common between the reference and the supplied variant sets. Check the regions supplied by --region or --bed-regions are correct. my_regions: chr1 826206 827522 chr1 827683 827775 ... confidence regions: chr1 821881 822046 chr1 823126 823188 chr1 823426 823479 chr1 823580 826321 chr1 826481 827827 chr1 827928 839291 ... RESULTS: With the codes 5 and 6 the problem is that I have a recall of 1.65% ... the precision is 97.07% and the F-score is 0.033 What is the problem about "recall" and at the end with F-score? The intervals? How can I fix the problem with the bed files? How they done the analysis considering that the intervals in the to bed files do not overlap? Many thanks for your time! benchmarking • 297 views ADD COMMENT 0 Entering edit mode can we see a snippet of the query.vcf? ADD REPLY 1 Entering edit mode chr1 14677 . G A 471.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.113;DP=53;ExcessHet=3.0103;FS=3.209;MBQ=30,30;MFRL=164,149;MLEAC=1;MLEAF=0.500;MMQ=24,24;MPOS=26;MQ=28.34;MQRankSum=0.903;QD=9.83;ReadPosRankSum=-0.156;SOR=0.246 GT:AD:DP:GQ:PL 0/1:26,22:48:99:479,0,581 ADD REPLY 0 Entering edit mode just to be clear your liftover regions are only used with GATK, not hap.py or rtg, correct? ADD REPLY 0 Entering edit mode In GATK (BQSR and HaplotypeCaller) I used the lifted over of the original "nexterarapidcapture_expandedexome_targetedregions.bed.gz" The same file (lifted over) I used also in the point 6, --region=${my_regions}

In the points 4 and 5 (-f option) I used "HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_noCENorHET7.bed"

ADD REPLY
2
Entering edit mode

you need to run vcfeval with --bed-regions not --regions, the latter is for inline <sequence_name>:<start>-<end>-type arguments

ADD REPLY
1
Entering edit mode

You are great man!!! I really thank you !!! Now at least the code in the point 6 works!

Threshold    Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
3.000       0.9437      0.8938      0.9181
None        0.9432      0.8940      0.9180

Now I have to fix points 4 and 5!

In case, are general guidelines to improve the pipelines? It is not so good for the moment looking the data ...

In general is better to use WGS than WES data for this kind of analysis, always about the NA12878?

Many thanks

ADD REPLY
0
Entering edit mode

a lot of benchmarking focuses on HG002 NA24385. I think exomes or even smaller subsets of genes are still more popular than WGS for benchmarks. The company I am working for, Truwl, is developing some dashboards to make it a lot easier to manage these type of comparisons. I'll follow up here when a demo is ready.

ADD REPLY
0
Entering edit mode

Many thanks, I will also try with HG002 !

ADD REPLY

Login before adding your answer.

Traffic: 2470 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6