Question

Calling CFTR Poly TG/Poly T Genotype

1

Entering edit mode

5.2 years ago

Robert Sicko ▴ 630

Background: The CFTR gene has a Poly TG/Poly T region in intron 8 that has varying clinical consequences. I'm using targeted NGS for this region (150bp, paried end) and I'm trying to call the PolyTG/PolyT genotypes from fastq files. My first attempt was bbmap run with default parameters and a reference including (TG)9-12(T)5,7,9:

bbmap.sh ref=$ref in1=$R1 in2=$R2 scafstats=$out

39/40 samples could be correctly called based on the bbmap stats (header shown below). Heterozygous calls were made from ref_name's with the highest unambiguous reads. Homozygotes could be figured based on the ratio of the top two.

  #sample   ref_name    per_unambiguousReads    unambiguousMB   per_ambiguousReads  ambiguousMB unambiguousReads    ambiguousReads  assignedReads   assignedBases

I expanded the reference to catch rare genotypes (e.g. T6 and TG13). The expanded reference has (TG)8-13(T)2-10. Running bbmap with this reference results in many incorrect calls.

I'm wondering if there are parameters I should be tweaking in bbmap or if there is a better solution to calling the Poly TG/Poly T genotypes that I'm not thinking of.

Thanks, Bob

NGS alignment • 2.5k views

ADD COMMENT • link updated 5.2 years ago by Charles Warden 8.2k • written 5.2 years ago by Robert Sicko ▴ 630

1

Entering edit mode

5.2 years ago

Charles Warden 8.2k

I hope you have a lot of good positive control variants / samples for troubleshooting.

For example, I would have the following concerns:

1) If you are doing targeted sequencing, you may have more of a need to have specialized variant calling. For example, GATK can do some weird things in high coverage regions (like >1000x), and I think Figure in the Warden et al. 2014 paper kind of illustrates my point (although perhaps more so for samtools than GATK):

enter image description here

Not sure how much a difference looking for unique sequences in reads would make, but I'm assuming that requires no nearby SNPs.

Also, in the figure above, please notice the scale of the y-axis is different (particularly for the left plot with the most focused sequencing).

2) I am a cystic fibrosis carrier, and I had reports from some companies that falsely identified me as not being a carrier (even though you could verify the variant in the raw data):

https://github.com/cwarden45/DTC_Scripts/tree/master/Helix_Mayo_GeneGuide

https://github.com/cwarden45/DTC_Scripts/tree/master/Genos_Exome (please scroll to the bottom for alignment screenshot: also true for GET-Evidence report for Veritas WGS data)

So, carefully considering all steps in the analysis is important. I like the idea of specializing in assessment for a particular disease / gene (for clinical reporting / interpretation), but I don't there is way to quickly be confident your strategy will be without problems (you need to spend a substantial amount of time critically assessing your ideas, and still expect a need for troubleshooting / support after providing something in the clinic).

I realize this is probably for research purposes, and I think you have a good point about needing some way to assess each sample (preferably in a way that can be done without coding). In the research setting, I think you should definitely expect some need to modify strategies. In the clinical (or DTC) setting, you need to be required to have access to your raw data, but I can see how this could cause issues with physicians (at least in settings where they have to be able to make decisions with a limited amount of available time).

ADD COMMENT • link 5.2 years ago by Charles Warden 8.2k

0

Entering edit mode

Charles, thank you for the thoughtful response. I completely agree with your statement on the need to throughly validate variant calling, especially in a targeted panel, especially when reporting clinically. We are validating the assay for clinical use (screening). The script I posted above is only for TG(m)T(n) calling. The snp/indel, structural variant and CNV calling is all done using Archer Analysis, which under the hood uses Freebayes and in-house algorithms specific to their VariantPlex assay (anchored multiplex PCR w/ molecular barcodes). We are (IMO) very throughly validating the assay with multiple samples containing variants of all types (snp/indel/SV/CNV) - 200+ samples. In addition to these samples, we plan to generate in-silico positives to test more of the left/right alignment issues with indels. The script above was tested on these 200+ samples and correctly identified the (TG)m(T)n status for all samples. I will also add, any reportable (TG)m(T)n allele will be confirmed using an independent extract and technology.

Thanks for the details of your carrier information. It is technical issues such as this that keep me up at night! Even for pipelines that are heavily tested, there are always fringe cases that you did not think of. It is splitting hairs, but I would argue that the Mayo report is not incorrect in stating you are not a carrier [for the variants they look at]. The GET-evidence report im puzzled by; it correctly right-aligned it "CFTR-L88Shift", so I'm not sure why the correct annotations were not pulled.

ADD REPLY • link 5.2 years ago by Robert Sicko ▴ 630

0

Entering edit mode

The Mayo Report actually uses Exome Sequencing (not just the SNPs that they describe), so I was kind of surprised they don't consider that for their report (and suggested they report more variants with varying levels of confidence, if that is what they were worried about - even though the variant already had 3 stars in ClinVar).

However, unless you can open the file in Excel, I think visualizing the gVCF is more difficult to check / re-analyze than a .bam file (which I thought showed very clear evidence for my variant). So, that is one of the reasons I currently wouldn't recommend other people use that for testing.

Maybe there is some other explanation, but I thought the indel formatting was a factor (and I'm sure that must contribute to some formatting issues, although knowing the variant is in a homopolymer run from freebayes could also be good information to know).

Thank you for the detailed response!

ADD REPLY • link 5.2 years ago by Charles Warden 8.2k

0

Entering edit mode

I also see that NP_000483.3:p.Leu88fs is an alias in dbSNP, so perhaps that does match the "Insufficient Evidence" tab for GET-Evidence (although I probably should have mentioned the ClinVar report, even thought I don't see any CFTR variants there, for whatever reason).

Thank you for causing me to look into that! I updated the GitHub page with the IGV screenshot (it is a note for within the Genos Exome section, mentioning the Veritas WGS results on the PGP; however, I eventually hope to have a blog post that may make this more clear).

ADD REPLY • link 5.2 years ago by Charles Warden 8.2k

0

Entering edit mode

Charles, I looked into GET-evidence further. I see their about page does at least states it is not clinical quality:

"GET-Evidence is not a “clinical quality” tool for diagnosis or treatment of any disease or medical condition."

They do allow anyone with a login to edit their annotations, so this variant and others that I spot checked against CFTR2 "known disease causing" could be corrected.

The Genos VCF looks malformed. I'm not sure if that contributed to the miss or not (maybe a silent fail on the annotation tool?):

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  unknown Sample1

However, the variant is in the VCF:

zcat K33YDXX.vcf.gz | grep "117149"
chr7    117149181   .   CTTTTTA CTTTA   187.17  PASS    AB=0.291667;ABP=12.0581;AC=2;ADP=25;AF=0.5;AN=4;AO=7;CIGAR=1M2D4M;DP=24;DPB=22.4286;DPRA=0;EPP=5.80219;EPPR=3.55317;GTI=0;HET=1;HOM=0;LEN=2;MEANALT=2;MQM=60;MQMR=60;NC=0;NS=1;NUMALT=1;ODDS=43.0979;PAIRED=1;PAIREDR=1;PAO=0.5;PQA=20;PQR=20;PRO=0.5;QA=279;QR=650;RO=16;RPP=5.80219;RPPR=3.55317;RUN=1;SAF=5;SAP=5.80219;SAR=2;SF=0,1;SRF=8;SRP=3.0103;SRR=8;TYPE=del;WT=0    GT:PVAL:AD:RDR:ADF:RO:FREQ:GL:ADR:RDF:GQ:QR:DP:RD:SDP:RBQ:AO:QA:ABQ 0/1:.:.:.:.:16:.:-10,0,-10:.:.:.:650:24:.:.:.:7:279

This variant is in the Helix gVCF (note Helix pipeline used hg38). I can see Mayo limiting their analysis to particular variants that they are very confident in a stable classification. I agree variants with "expert review" should have correct classifications, but I also don't have a good feel for how often a "pathogenic" 3 star variant is later found VOUS or benign:

   bcftools view -H -r chr7:117509120-117509150 Helix_ExomePlus_variants_1550345278.vcf.gz
chr7    117508942   .   T   <NON_REF>   .   PASS    END=117509126   GT:DP:GL:GQ:MIN_DP  0/0:86:-0,-10.8,-147:99:44
**chr7  117509127   .   CTT C   1199.73 PASS    .   GT:AD:DP:GL:GQ:VAR_TYPE 0/1:42,36:78:-123.7,-0,-150.9:99:DELETION**
chr7    117509130   .   T   <NON_REF>   .   PASS    END=117509152   GT:DP:GL:GQ:MIN_DP  0/0:77:-0,-10.2,-153:99:57

It is in the Genos VCF:

genos=hu832966_20170311194858.gz
zcat $genos | grep CFTR
chr7    117149181   .   CTT C   244.73  PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=1.11;ClippingRankSum=-1.022e+00;DP=19;ExcessHet=3.0103;FS=8.193;MLEAC=1;MLEAF=0.500;MQ=60.57;MQRankSum=0.844;QD=13.60;ReadPosRankSum=-3.370e-01;SOR=2.833;set=variant   GT:AD:DP:GQ:PL:CGIANN_VARNAME:CGIANN_1000GAF:CGIANN_ESP6500AF   0/1:10,8:18:99:282,0,361:-,NM_000492.3(CFTR) c.259T[3] (p.Leu88Ilefs*22):-,NOT_FOUND:-,NOT_FOUND

ADD REPLY • link 5.2 years ago by Robert Sicko ▴ 630

0

Entering edit mode

Awesome - I am glad that this interested you enough to look into it more!

I didn't previously know about the CFTR2 resource - I think that is something really good to have. I used it to look up more information about 394delTT, and that also said that I should be considered a carrier.

I am also testing out updating the GET-Evidence report (to point out the dbSNP and CFTR2 entries, and I also added 4 PubMed entrires), and I noticed that it is getting a prioritization of 4. However, it didn't automatically connect to dbSNP, and (as far as I could tell) I couldn't link that myself (but I sent an e-mail that hopefully gets that sorted out).

Thanks again for following up!

ADD REPLY • link 5.2 years ago by Charles Warden 8.2k

0

Entering edit mode

Also, I'm not sure if people who otherwise get sequencing from Veritas for WGS have a different report (I got data from them, but through the PGP, with PGP reports). However, I did have an on-line physican sign-off for Veritas, Genos, GeneGuide, and Color (although I never talked to any of them, so I assume everybody gets approved). GeneGuide and Color also include support from a genetic counselor in their pricing.

However, Mayo GeneGuide specifically tests disease status for 4 genes (one of which is CFTR). They do Exome sequencing that is capable of detecting my variant (even though I could tell they weren't checking that in the formal report). ClinVar, CFTR2, and the genetic counselors all agreed that I should be considered a carrier, so I suggested GeneGuide support should look into changing what they include in their report (since I was reported as not being a carrier in their test).

Within the genetic testing registry, I would find "Whole Genome Sequencing" as a "clinical test" for Veritas, but I am not entirely clear what is counted under the "hereditary disease" portion of their test. Color is also in the GTR (and, if I remember correctly, I think they did provide the most detailed approval e-mail), but cystic fibrosis is not among the 26 conditions they official test for (although I hope they can update their options to provide raw data while I wait for the additional testing to complete, especially since I picked them to test because of the data.color.com interface). "Mayo Clinic Genetic Testing Laboratories" is in the GTR, but not the Mayo GeneGuide (or Helix). I could not find Genos as a "lab" in the GTR.

23andMe is also not in the GTR, even though they have FDA approval for some reports, and they were the only one that correctly identified me as being a cystic fibrosis carrier (even though I could have identified myself as a carrier with re-analysis of the raw data from Veritas, Genos, or GeneGuide/Helix). I know they have CLIA certification. I thought the cystic fibrosis result was FDA-approved because I currently see it, and they had to remove a lot of results for a period of time; however, I don't think the cystic fibrosis status is actually approved by the FDA because I don't see 23andMe listed for cystic fibrosis on this page from the FDA.

Update (4/20): I previously didn't think my variant was on the V5 chip (I had a V3 chip), but this was a formatting issue. I was looking for rs121908769, but I needed to use 23andMe's identifer.

So, this is what my cystic fibrosis variant looks like in 23andMe's raw data that you can download:

i4000313    7   117149185   DI

(carriers will be "DI," people that get cystic fibrosis from having two copies of this same variant will be "DD", even though this is not the most common cystic fibrosis variant)

I apologize for the confusion: I no longer have the specific concern about there being informative probes removed from the later array (at least for this variant).

Also, I figured this out when a relative informed me about their result. So, maybe the cystic fibrosis status is approved by the FDA to be returned to customers now? I admittedly still don't seem them on the page linked above, but I'm not otherwise sure about the explanation.

ADD REPLY • link 5.0 years ago by Charles Warden 8.2k

score 1 · Accepted Answer · 2019-02-27

Just to wrap this up. I found this paper. I contacted Dr. Pagin and he explained their script used grep on fastqs and counted different alleles to make a call per sample. I used that idea in this script. If anyone uses this script in the future, the HET_CUTOFF (line 554) will likely vary depending on the panel you use to sequence CFTR; adjust according to your data.