I was performing variant analysis for a plasmid sample sequenced on MiSeq with read length PE300. More than one million reads were generated. The mapping rate of the reads to the reference was 98%. I used samtools to detect variants and generate consensus sequence. 3 variants were detected for the sample as follows:
6354 ACGC>ACGCGC 100% 6108X 10384 CCGGGCGACCTAAGCCCGGGCGTCGGGCGACCT>CCGGGCGACCT 10% 2334X 10469 GACGT>G 100% 4875X
The variant at 10384 is present at 10% variant frequency.
The coverage at this position is low. I visualized the variant in IGV and sorted it based on base position.
The restriction analysis at this site indicates that the variant is present at a much higher frequency as the Sma1 is not cutting through this site.
Sanger sequencing performed for this site was inconclusive.
To analyse this deletion I performed de-novo assembly using SPAdes. I got a several contigs, from which I selected the one with more than 10,000bp and 100X coverage. I performed a BLAST of reference for this sample and the contig generated. The Blast output only shows the variant present at position 6354 and not the other two.
Query 6301 CCGATTCGCAGCGCATCGCCTTCTATCGCCTTCTTGACGAGTTCTTCTGAATTACGC--C 6358 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||| | Sbjct 4140 CCGATTCGCAGCGCATCGCCTTCTATCGCCTTCTTGACGAGTTCTTCTGAATTACGCGCC 4081
The length of the reference is 12089bp. The length of the contig is 11526bp.
I also performed variant calling again on the merged overlapping R1 and R2 reads. The variant calls are as follows:
6354 ACGC>ACGCGC 100.0 1057X 6357 CCAACGCGTTTG>CGGCAACGCGTTTG 94.59 1032X 10469 GACGT>G 100.0 608X
I am very confused by these results. How can I troubleshoot this and find the accurate variant frequency? I thought of some possible scenarios why VF % at position 10384 is less.
- The low coverage at the desired position.
- The length of the INDEL is greater than 50bp resulting in aligner not aligning reads at this position.
- There is a presence of secondary structure present at these position in plasmids resulting in inconclusive Sanger sequencing analysis, low number of reads generated at this position or incorrect INDEL call at the position.
Can anyone suggest me any other way to affirm this and how to interpret such results using NGS data?
Thanks in advance!!!