I am working on some data but need a little bit of help with a bit of an unusual task. We are looking at where lentiviral DNA has inserted itself in our host genome, and to do this we need to find the boundaries of the junctions between viral/human DNA. I need to be able to look at only the soft/hard clipped reads within my bam file. Does anyone know a way to do this with awk or some other tool? Thanks! What I have tried:
Have you considered using a SV breakpoint calling tool (using a reference that includes both host and viral sequence), or a viral integration detection tool? Both of the above approaches should work if your integration site is clonally expanded. If not, there's other software designed for this purpose, although they usually require special sequencing protocol (e.g HIV integration site detection).
To actually answer you question, you can use the ExtractSVReads tool within gridss to do this natively within bam: