Question: Splitting a bam file by unique optional TAG field
2
gravatar for dccroucher
21 months ago by
dccroucher20
dccroucher20 wrote:

I have a large bam file that I need to subset/split into separate bam files by unique entries of one of the optional TAG:TYPE:VALUE fields. As a simplified example, say these are the optional fields in my example.bam file (i.e. fields following the mandatory 1-11 fields):

NH:i:6  HI:i:3  AS:i:94 nM:i:1  NM:i:1  CR:Z:TGCCCATCACTCAGGC   CY:Z:CCCBBFGGGGGGGGGG   CB:Z:TGCCCATCACTCAGGC-1  UR:Z:ACGCCCAGGT  UY:Z:GGGGGGGGGG  UB:Z:ACGCCCAGGT  BC:Z:TGCGCAGC   QT:Z:CCCCCGGG   RG:Z:XXXXXX

NH:i:6  HI:i:3  AS:i:94 nM:i:1  NM:i:1  CR:Z:TGCCCATCACTCAGGC   CY:Z:CCCBBFGGGGGGGGGG   CB:Z:TGCCCATCACTCAGGC-1 UR:Z:ACGCCCAGGT 

NH:i:7  HI:i:3  AS:i:96 nM:i:0  NM:i:0  CR:Z:TGCCCATCACTCAGGC   CY:Z:BCBBCFGGGGGGGGGG   CB:Z:TGCCCATCACTCAGGC-1 UR:Z:ACGCCCAGGT 

NH:i:7  HI:i:3  AS:i:96 nM:i:0  NM:i:0  CR:Z:TGCCCATCACTCAGGC   CY:Z:CCCCCGGGGGGGGGGG   CB:Z:TGCCCATCACTCAGGC-1 UR:Z:ACGCCCAGGT 

NH:i:3  HI:i:2  AS:i:96 nM:i:0  NM:i:0  CR:Z:TGCCCATCACTCAGGC   CY:Z:CCCCCGGGGGGGGGGG   CB:Z:TGCCCATCACTCAGGC-1 UR:Z:ACGCCCAGGT 

NH:i:3  HI:i:2  AS:i:96 nM:i:0  NM:i:0  CR:Z:TGCCCATCACTCAGGC   CY:Z:CCCCCGGGGGGGGGGG   CB:Z:TGCCCATCACTCAGGC-1 UR:Z:ACGCCCAGGT 

NH:i:6  HI:i:4  AS:i:96 nM:i:0  NM:i:0  CR:Z:CCGGTAGGTCATACTG   CY:Z:CCCCCGGGGGGGGGGG   CB:Z:CCGGTAGGTCATACTG-1 UR:Z:GCCGCCTTCT 

NH:i:6  HI:i:4  AS:i:96 nM:i:0  NM:i:0  CR:Z:CCGGTAGGTCATACTG   CY:Z:CCCCCGGGGGGGGGGG   CB:Z:CCGGTAGGTCATACTG-1 UR:Z:GCCGCCTTCT 

NH:i:6  HI:i:4  AS:i:96 nM:i:0  NM:i:0  CR:Z:CCGGTAGGTCATACTG   CY:Z:CCCCCGGGGGGGGGGG   CB:Z:CCGGTAGGTCATACTG-1 UR:Z:GCCGCCTTCT 

NH:i:6  HI:i:4  AS:i:96 nM:i:0  NM:i:0  CR:Z:CCGGTAGGTCATACTG   CY:Z:BBBCBGGEGGGGGGGG   CB:Z:CCGGTAGGTCATACTG-1 UR:Z:GCCGCCTTCT 

NH:i:6  HI:i:4  AS:i:96 nM:i:0  NM:i:0  CR:Z:CCGGTAGGTCATACTG   CY:Z:CCCCCGGGGGGGGGGG   CB:Z:CCGGTAGGTCATACTG-1 UR:Z:GCCGCCTTCT

How would I split this bam file by unique CB:Z: entries, such that the results would be 2 bam files as follows:

TGCCCATCACTCAGGC-1.bam (contains 6 reads) CCGGTAGGTCATACTG-1.bam (contains 5 reads)

Thanks! D

sequencing alignment merge bam • 2.1k views
ADD COMMENTlink modified 21 months ago by Pierre Lindenbaum118k • written 21 months ago by dccroucher20

See this past thread: Sorting .bam file by tag
looks like bamtools (https://github.com/pezmaster31/bamtools/wiki ) may be able to do this.

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax65k

bamtools split -in test.bam -tag CB

Try this but the problem is there are too many files and bamtools can't open the bam file for writing.

See https://github.com/pezmaster31/bamtools/issues/135

ADD REPLYlink written 15 months ago by marong05110
4
gravatar for Pierre Lindenbaum
21 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

run a loop to find the distinct tag and pipe this into a samtools+awk to filter. Something like (not tested)

samtools view  in.bam | cut -f 12- | tr "\t" "\n"  | grep  "^CR:Z:"  | cut -d ':' -f 3 | sort | uniq | while read S; do samtools view -h in.bam |  awk -v tag="CR:Z:$S" '($0 ~ /^@/ || index($0,tag)>0)' > ${S}.sam ; done
ADD COMMENTlink modified 5 months ago • written 21 months ago by Pierre Lindenbaum118k
1

Shouldn't this be cut -f 12- rather than cut -f 12?

ADD REPLYlink modified 5 months ago • written 5 months ago by WouterDeCoster38k

yes, you're right ! thanks, I will fix it

ADD REPLYlink written 5 months ago by Pierre Lindenbaum118k

Thanks so much for your quick response Pierre! Would you be able to walk me through your loop command, I'm not sure I understand exactly what it's doing...

Thanks :)

ADD REPLYlink written 21 months ago by dccroucher20
2
  • samtools view show the bam lines
  • cut after column 12 after the QUAL
  • tr convert tab to newline
  • grep : keep lines starting with CR:Z (oppps there was an error , option -F was not needed)
  • cut after 3rd colon to get the value
  • sort / uniq to et the distinct tags

for each tag

  • print sam with header
  • use awk to keep the header and the SAM line containing the tag
ADD REPLYlink written 21 months ago by Pierre Lindenbaum118k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1356 users visited in the last hour