Hi, I have a bam file from WGS sequencing, wehre I try to identify the integration site(s) of a transgene insertion in the mouse genome.
I have run the mapping of the paired-end samples with bwa against the indexed transgene sequence. The goal was to identify those reads which were mapped with only one read of the pair to the transgene (while the other read of the pair will map to the mouse genome) or such reads, which were soft-clipped, as only part of the read was mapped to the transgene.
I know that the flag
SA stands for such reads, but I am not sure how to extract such reads from the bam file. When I do
grep SA: file.bam do I extract both pairs or only the one which was mapped to the transgene.
Is it better to use samtools to extract reads with a specific flag. Are both reads of the pair being extracted then?
Is there a better way to identify such integration sites of exogenous sequences?
see Extracting chimeric reads from mapping
Thanks Pierre, I am not sure how to use this snippet to my needs. The bam file I have is the results of the combined indexed genome from mouse and transgene. How does it knows which chromosome belongs to mouse and (e.g. host) which to transgene (e.g. virus)? Do I need to change the script to something like this-
But a lot of the reads I have extracted don't have any
SAflags to them
Any ideas why this is?
What's the point of giving a standard reply, if you're not willing to help understanding the answer you give? I have tried to understand your tool and the comments in the linked answer, but I am not sure if this is the correct method. I would appreciate further help.