Question

Interesting discrepancy between NGS (Illumina) and Sanger seq.

1

Entering edit mode

16 months ago

Manuel ▴ 40

I have an interesting case I would like to share with you

From our NGS bioinformatics pipeline we got the following 4 variants (top of following image) and then doing sanger sequencing we got one variants

enter image description here

DNA from blood

pipeline BWA-men and HaplotypeCaller

I have a couple of questions, first, why the first variant is unphase if it is in the same read of many read sequenced enter image description here Second, I understand the limitation of short-read technology but these delins is small and it is coveraged for many reads so why this significant different between both approaches?

NGS sanger Illumina variant • 1.9k views

ADD COMMENT • link updated 16 months ago by barslmn ★ 2.1k • written 16 months ago by Manuel ▴ 40

1

Entering edit mode

Post the sequences as text so that people can align them.

Your alignments also look a little weird. Sometimes a mismatch is represented as a gap symbol. Why is that?

In general, though, this is a problem with low information repetitive regions that have variation in them. It is a well-known problem.

What helps here is if you manually run the alignments and see how and why the math resolves them a certain way.

I would do, but I don't want to type up the sequences

ADD REPLY • link 16 months ago by Istvan Albert 100k

0

Entering edit mode

Thanks for your comment.

This is one read with the three last variants of the table above

Mapping = Primary @ MAPQ 60 Reference span = chr15:48,740,946-48,741,090 (-) = 145bp Cigar = 43M3I14M3I88M

Clipping = None

Mate is mapped = yes Mate start = chr15:48740924 (+) Insert size = -166 Second in pair

Pair orientation = F1R2

MC = 64M3I14M3I67M NM = 8 AS = 117 XS = 20

Hidden tags: MD, RG

Location = chr15:48,741,010 Base = A @ QV 40

Alignment start position = chr15:48740946 ATAATAATTGCATACTTACCCAAGCACATGGTTTGGTCATCATTTGTTGTTTTAAAACAAATGATGTGGCAAAGGCAATAAAAGCTTCCAACTGTGTCAATGCACTGCCCATGACTGCATATATTGGGGATTTCTTGACATTCATTACGAT

and this is the reference

tttattttgt atatagcaaa aatactacta aaagacttag tattaaattt 48740895 tatccatatt tagaatcaaa tgaagctttc aacagcatat gaaaaaaata 48740945 ATAATAATTG CATACTTACC CAAGCACATG GTTTGGTCAT CATTTGTTTT 48740995 AAAACcagTG TGGCAAAGGC AATAAAAGCT TCCAACTGTG TCAATGCACT 48741045 GCCCATGACT GCATATATTG GGGATTTCTT GACATTCATT ACGATctgta 48741095 aataagaagc atcttaagtg agaacttaga agacaaaata taattgaata 48741145 acttacttct agctatcatt ctcaggagta atcctagctc taaac.

If this is not what you asked for or need something else, please let me know.

ADD REPLY • link 16 months ago by Manuel ▴ 40

1

Entering edit mode

The sequences are as follows:

Ref:  ATTTGTTTTAAAACCAGTG
Read: ATTTGTTGTTTAAACAAATGATG
SS:   ATTTGTTGTTTTAAAACAATGATAAAACTAAAACCAGTG

Personally it looks to my eye like some of the reads are soft-clipped. HaplotypeCaller still uses soft-clipped bases in graph-based local assembly, but perhaps your caller did not?

ADD REPLY • link 16 months ago by LChart 3.9k

1

Entering edit mode

the reference should not be shorter than the longest alignment -

you should include the reference from the leftmost to the rightmost (and probably 10 more bases before and after as well) otherwise we can't recreate the alignments.

ADD REPLY • link 16 months ago by Istvan Albert 100k

1

Entering edit mode

Sorry Istavan, I can only provide the sequences in the OP's post. Probably this is the OP's responsibility.

ADD REPLY • link 16 months ago by LChart 3.9k

0

Entering edit mode

I have checked and there are several reads with soft-clipped.

ADD REPLY • link 16 months ago by Manuel ▴ 40

score 1 · Answer 1 · 2022-12-16

1

Entering edit mode

16 months ago

barslmn ★ 2.1k

This looks like a HGVS nomenclature issue. Especially look at the notes part at the delins page. https://varnomen.hgvs.org/recommendations/DNA/variant/delins/

Also did you create the pairwise alignments yourself? Mutalyzer gives seqeunce changes based on hgvs. https://mutalyzer.nl/

ADD COMMENT • link 16 months ago by barslmn ★ 2.1k

0

Entering edit mode

Thanks for you answer, both comments help me to see what is going on here. I have used mutalizer to create a more accurate pairwise as shown below

enter image description here

You mentioned that this look like an HGVS issue and I partially agree with you. But I was wondering is you agree with me that this is also a variant caller issue. The first and the last variant match with the 5th variants (the one called with Sanger seq.) but the second and the third are artefact right? The second variants A>C is an artefact based on sanger data and the fourth is a partially duplicate variants?

Are these comments correct. I am relatively new in bioinformatics and I would just like to confirm that what I am seen is ok

ADD REPLY • link 16 months ago by Manuel ▴ 40

1

Entering edit mode

Your variants look much cleaner. I don't think this is a variant caller issue. Variant caller has no genome context so it has no idea about naming the variants according to hgvs. Alignments can get funky around the indels thats why you can get false positives around indels. GATK has lots of resources on how to filter and analyze the data. https://gatk.broadinstitute.org/hc/en-us/articles/360035531112--How-to-Filter-variants-either-with-VQSR-or-by-hard-filtering

However, I wouldn't comment on what is false positive or not without inspecting IGV and chromatograms.

ADD REPLY • link 16 months ago by barslmn ★ 2.1k

0

Entering edit mode

I final question. Is the distribution of the allelic reads something I should be concerned it. Why the distribution is not close to 50-50

ADD REPLY • link 16 months ago by Manuel ▴ 40

1

Entering edit mode

If you're using amplicon sequencing there might be PCR bias or there might be strand bias.

https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-13-666

https://www.sciencedirect.com/science/article/pii/S1532046422002398

ADD REPLY • link 16 months ago by barslmn ★ 2.1k