extracting orphan reads from bam file
Entering edit mode
5.1 years ago
Assa Yeroslaviz ★ 1.8k

Hi, I have a bam file from WGS sequencing, wehre I try to identify the integration site(s) of a transgene insertion in the mouse genome.

I have run the mapping of the paired-end samples with bwa against the indexed transgene sequence. The goal was to identify those reads which were mapped with only one read of the pair to the transgene (while the other read of the pair will map to the mouse genome) or such reads, which were soft-clipped, as only part of the read was mapped to the transgene.

I know that the flag SA stands for such reads, but I am not sure how to extract such reads from the bam file. When I do grep SA: file.bam do I extract both pairs or only the one which was mapped to the transgene.

Is it better to use samtools to extract reads with a specific flag. Are both reads of the pair being extracted then?

Is there a better way to identify such integration sites of exogenous sequences?



bam samtools bwa • 2.1k views
Entering edit mode
Entering edit mode

Thanks Pierre, I am not sure how to use this snippet to my needs. The bam file I have is the results of the combined indexed genome from mouse and transgene. How does it knows which chromosome belongs to mouse and (e.g. host) which to transgene (e.g. virus)? Do I need to change the script to something like this-

java -jar ~/software/jvarkit/dist/samjdk.jar -e '
final Set<String> virus_chrom = new HashSet<>(Arrays.asList("transgene"));
if(record.getReadUnmappedFlag()) return false;
if(record.getReadPairedFlag() && !record.getMateUnmappedFlag())
return false;'  combined.sorted.bam > combined.sorted.ExtractedReads.bam

But a lot of the reads I have extracted don't have any SA flags to them

head combined.sorted.ExtractedReads.bam | head
K00289:70:HL7LCBBXX:3:1202:28351:39084  145     10      115415155       60      71M     transgene       4310    0       TACT...CCTT 7...A MC:Z:125M       MD:Z:23A47      NM:i:1  AS:i:66 XS:i:19
K00289:70:HL7LCBBXX:5:1216:29731:36939  145     12      113427284       41      2S149M  transgene       4297    0       TCG...AATT 7...A   XA:Z:transgene,-4640,16M1I117M1I16M,3;  MC:Z:151M       MD:Z:149        NM:i:0  AS:i:149        XS:i:130
K00289:70:HL7LCBBXX:5:1126:7141:9473    81      12      113427289       45      151M    transgene       4281    0       CTAA...CAGG J...A   XA:Z:transgene,-4647,9M1I117M1I23M,3;   MC:Z:150M       MD:Z:151        NM:i:0  AS:i:151        XS:i:130
K00289:70:HL7LCBBXX:5:1220:14255:6150   81      12      113428791       47      96M55S  transgene       5791    0       TCAC...AGA 7...A   SA:Z:7,36210485,+,44M107S,47,0; XA:Z:transgene,-6148,26M1I59M65S,1;     MC:Z:151M       MD:Z:96 NM:i:0  AS:i:96 XS:i:78

Any ideas why this is?

thanks, Assa

Entering edit mode

What's the point of giving a standard reply, if you're not willing to help understanding the answer you give? I have tried to understand your tool and the comments in the linked answer, but I am not sure if this is the correct method. I would appreciate further help.


Login before adding your answer.

Traffic: 1009 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6