I am looking for a tool that can identify longer insertions using genome resequencing data (based on Illumina reads).
I am searching for something that does this basically.
Initially, Illumina paired-end sequencing reads with average Phred scores ≥20 were retained, and duplicate sequences were removed using FastQC. These qualified reads were classified into three groups: 1) reads derived from rice endogenous genomic regions; 2) reads derived from a plasmid sequence containing transfer DNA; and 3) reads derived from the location of transgene integration sites that spanned the junction between plant and transgene sequences. To obtain the third group of reads, reads were first mapped back to transformation plasmid vector sequences (pPZP200 including T-DNA) using the Burrows-Wheeler Aligner with maximum exact matches (BWA-MEM) with a minimum seed length = 50 and band width = 2 while keeping the other default parameters . Mapped reads were then used as queries against the rice reference genome (Oryza sativa version 7.0) using BLAST (version 2.6.0), and reads were classified as false-positive if they aligned to rice endogenous gene rbcS3 (Os12g0291100) with an e-value of 1 × 10−5. The remaining reads were aligned against the entire transformation plasmid sequences and visualized in the Integrative Genomic Viewer (IGV). From the IGV results, reads that matched against both ends of the T-DNA were collected and subjected to multiple sequence alignments to identify the insert junction location on the rice chromosome. The inserted junction location was identified using NCBI-BLAST against the rice reference genome (O. sativa).
Are there tools out there that can do this?
Or is the above paper a good approach to tackle such a problem?