I want to identify HBV (virus) integration sites in human genome.
I have single end CAGE-seq (~30 nucleotide long) data on HBV patients. I mapped reads to human genome (used Bowtie2 version 2.29), and unmapped reads were mapped to HBV genome.
From the remaining unmapped reads, I want to find reads that partially align to human and partially to HBV. If the data was pair-end, it would have been slightly easier. Can you please suggest how do I systematically (logic how to do it, I can implement it) get this information to find the integration site.
I am thinking of fragmenting (keep same FASTQID) the unmapped reads and map to both human and HBV, and identify which IDs map to both human and HBV.
Any suggestion on how i could do this efficiently. Does using BWA help in this case?
Thank you !!
Thank you Pierre ! So you suggest to merge genome sequence of human + virus. Then map the unmapped reads using BWA on combined genome assembly. Reads that map partially to human and virus will be flagged as discordant. I think it is a neat idea. What is the flag for discordant reads.
there is no specific SAM flag, look at the link : Extracting chimeric reads from mapping