Hello everyone,
It is my first attempt to work with MIP data. I noticed something unexpected. Unfortunately, a person, who provided the data, is not available. I hope you can help me with it.
I have run a script to remove duplicates from the MIPseq fastq files.
Briefly, the script functions by:
- matching sequences from each read to the extension & ligation arms from a probe in the provided design file with < 2bp mismatched bases;
- for each probe, reads with the same UMI (ie. PCR duplicates) are compared and only the read-pair with the highest average base quality is output.
This process generates probe-matched read-pairs without PCR duplicates, which represent unique DNA target capture events, and gives a summary of the total number of reads, unmatched reads, unique matched reads, PCR duplicates and so on.
Question 1: I noticed that a half of my samples have >20% of unmatched reads. Certain samples have >80% of unmatched reads.
I am wondering if it is something expected for the MIPseq approach.
Question 2: Also, I noticed that unmatched reads look like they have the MIP probes, but the probes do not belong to the design file we were provided.
Please see an example below. The read pair UMI:TAGAAC EXT:GGGGGTGGTGGGACCG LIG(rev&compl):CCCGGTCCCACCACCCCC (the first rows on the screens) was removed, as EXT and LIG arms were not matched to the design file. However, I do see a lot of EXT GGGGGTGGTGGGACCG and LIG (rev & compl) CCCGGTCCCACCACCCCC in the corresponding FASTQ files.
How would you explain it? We were provided a wrong design file? Considering a high level of unmatched reads, some technical issues?
Thank you!
Maria.
fastq.R2
fastq.R1