Splitting a bam file by unique optional TAG field
1
3
Entering edit mode
6.8 years ago
dccroucher ▴ 30

I have a large bam file that I need to subset/split into separate bam files by unique entries of one of the optional TAG:TYPE:VALUE fields. As a simplified example, say these are the optional fields in my example.bam file (i.e. fields following the mandatory 1-11 fields):

NH:i:6  HI:i:3  AS:i:94 nM:i:1  NM:i:1  CR:Z:TGCCCATCACTCAGGC   CY:Z:CCCBBFGGGGGGGGGG   CB:Z:TGCCCATCACTCAGGC-1  UR:Z:ACGCCCAGGT  UY:Z:GGGGGGGGGG  UB:Z:ACGCCCAGGT  BC:Z:TGCGCAGC   QT:Z:CCCCCGGG   RG:Z:XXXXXX

NH:i:6  HI:i:3  AS:i:94 nM:i:1  NM:i:1  CR:Z:TGCCCATCACTCAGGC   CY:Z:CCCBBFGGGGGGGGGG   CB:Z:TGCCCATCACTCAGGC-1 UR:Z:ACGCCCAGGT 

NH:i:7  HI:i:3  AS:i:96 nM:i:0  NM:i:0  CR:Z:TGCCCATCACTCAGGC   CY:Z:BCBBCFGGGGGGGGGG   CB:Z:TGCCCATCACTCAGGC-1 UR:Z:ACGCCCAGGT 

NH:i:7  HI:i:3  AS:i:96 nM:i:0  NM:i:0  CR:Z:TGCCCATCACTCAGGC   CY:Z:CCCCCGGGGGGGGGGG   CB:Z:TGCCCATCACTCAGGC-1 UR:Z:ACGCCCAGGT 

NH:i:3  HI:i:2  AS:i:96 nM:i:0  NM:i:0  CR:Z:TGCCCATCACTCAGGC   CY:Z:CCCCCGGGGGGGGGGG   CB:Z:TGCCCATCACTCAGGC-1 UR:Z:ACGCCCAGGT 

NH:i:3  HI:i:2  AS:i:96 nM:i:0  NM:i:0  CR:Z:TGCCCATCACTCAGGC   CY:Z:CCCCCGGGGGGGGGGG   CB:Z:TGCCCATCACTCAGGC-1 UR:Z:ACGCCCAGGT 

NH:i:6  HI:i:4  AS:i:96 nM:i:0  NM:i:0  CR:Z:CCGGTAGGTCATACTG   CY:Z:CCCCCGGGGGGGGGGG   CB:Z:CCGGTAGGTCATACTG-1 UR:Z:GCCGCCTTCT 

NH:i:6  HI:i:4  AS:i:96 nM:i:0  NM:i:0  CR:Z:CCGGTAGGTCATACTG   CY:Z:CCCCCGGGGGGGGGGG   CB:Z:CCGGTAGGTCATACTG-1 UR:Z:GCCGCCTTCT 

NH:i:6  HI:i:4  AS:i:96 nM:i:0  NM:i:0  CR:Z:CCGGTAGGTCATACTG   CY:Z:CCCCCGGGGGGGGGGG   CB:Z:CCGGTAGGTCATACTG-1 UR:Z:GCCGCCTTCT 

NH:i:6  HI:i:4  AS:i:96 nM:i:0  NM:i:0  CR:Z:CCGGTAGGTCATACTG   CY:Z:BBBCBGGEGGGGGGGG   CB:Z:CCGGTAGGTCATACTG-1 UR:Z:GCCGCCTTCT 

NH:i:6  HI:i:4  AS:i:96 nM:i:0  NM:i:0  CR:Z:CCGGTAGGTCATACTG   CY:Z:CCCCCGGGGGGGGGGG   CB:Z:CCGGTAGGTCATACTG-1 UR:Z:GCCGCCTTCT

How would I split this bam file by unique CB:Z: entries, such that the results would be 2 bam files as follows:

TGCCCATCACTCAGGC-1.bam (contains 6 reads) CCGGTAGGTCATACTG-1.bam (contains 5 reads)

Thanks! D

sequencing alignment merge bam • 9.4k views
ADD COMMENT
0
Entering edit mode

See this past thread: Sorting .bam file by tag
looks like bamtools (https://github.com/pezmaster31/bamtools/wiki ) may be able to do this.

ADD REPLY
0
Entering edit mode

bamtools split -in test.bam -tag CB

Try this but the problem is there are too many files and bamtools can't open the bam file for writing.

See https://github.com/pezmaster31/bamtools/issues/135

ADD REPLY
6
Entering edit mode
6.8 years ago

run a loop to find the distinct tag and pipe this into a samtools+awk to filter. Something like (not tested)

samtools view  in.bam | cut -f 12- | tr "\t" "\n"  | grep  "^CR:Z:"  | cut -d ':' -f 3 | sort | uniq | while read S; do samtools view -h in.bam |  awk -v tag="CR:Z:$S" '($0 ~ /^@/ || index($0,tag)>0)' > ${S}.sam ; done
ADD COMMENT
1
Entering edit mode

Shouldn't this be cut -f 12- rather than cut -f 12?

ADD REPLY
0
Entering edit mode

yes, you're right ! thanks, I will fix it

ADD REPLY
0
Entering edit mode

Thanks so much for your quick response Pierre! Would you be able to walk me through your loop command, I'm not sure I understand exactly what it's doing...

Thanks :)

ADD REPLY
3
Entering edit mode
  • samtools view show the bam lines
  • cut after column 12 after the QUAL
  • tr convert tab to newline
  • grep : keep lines starting with CR:Z (oppps there was an error , option -F was not needed)
  • cut after 3rd colon to get the value
  • sort / uniq to et the distinct tags

for each tag

  • print sam with header
  • use awk to keep the header and the SAM line containing the tag
ADD REPLY

Login before adding your answer.

Traffic: 1679 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6