Question: Sorting .bam file by tag
0
gravatar for Srw
22 months ago by
Srw10
Charlottesville, VA
Srw10 wrote:

Hi Everyone, I'm trying to split a .bam file by a specific tag but am running into memory issues because the .bam file is position sorted and not tag sorted.

I'd like to sort by a tag called 'BX' and cannot find any information in the samtools or bamtools documentation on this. If anyone has any suggestions it would be greatly appreciated.

Thanks!

sequencing bamtools samtools • 1.8k views
ADD COMMENTlink modified 19 months ago by cif0770 • written 22 months ago by Srw10

split or sort, what is your problem? Please specify your need by an example giving a couple of initial lines from the bam-files

ADD REPLYlink written 22 months ago by Santosh Anand4.4k

Ultimately I want to split by tag but I believe I am running into memory problems because it's not sorted by the tag itself.

Here is the first line of my .bam file

ST-K00126:303:HCJNFBBXX:5:2111:5030:25791 65 chr1 9999 0 11S66M50S chr5 18606706 0 TTCTTTTTCTGGATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAGAGATAACTATTGATACAACACCTTCATGACCCTAAGGTACTATCATAG JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJFJJAJFJFJJJJJFFJJAF7JJJJFJJJJAJJJFJJJA<F<A<- NM:i:0 MD:Z:66 AS:i:66 XS:i:65 SA:Z:chr5,18606884,-,52M75S,0,0; RX:Z:TCCACACGTCCGGTTG QX:Z:AAFFFJJJJJJJJAFJ BX:Z:TCCACACGTCCGGTTG-1 BC:Z:AACCGTAA QT:Z:AAFFFFFJ RG:Z:35127:35127:1:HCJNFBBXX:5

ADD REPLYlink modified 22 months ago • written 22 months ago by Srw10

What are all possible values of the tag BX? AFAIR, you can split only on RG tag using samtools, not on any arbitrary tag. But it should not be difficult as I guess you can easily "grep" each of the tags individually if they are not a lot of them, and direct output to different files. BTW, what commands have you tried?

ADD REPLYlink written 22 months ago by Santosh Anand4.4k

There are many values for BX so a grep approach would be very slow. I've done bamtools split -in myfile.bam -tag BX

This splits the original .bam but only writes to a few of them because I'm running into this memory issue which I think is because the file is not correctly sorted by 'BX'

ADD REPLYlink written 22 months ago by Srw10

If that works, wouldnt it be a good idea to split your data file into different sub-parts. Then join all your split files later

ADD REPLYlink written 22 months ago by sridhar56100

I'm unable to understand your reasoning :-( So how did you deduce that 1) it's a memory issue and 2) that the bam needs to be TAG-sorted as you mentioned previously. The doc doesnt say anything about the sorting requirement AFAI can see.

ADD REPLYlink written 22 months ago by Santosh Anand4.4k

Sorry this seems to be unclear. I thought I might be running into memory issues because I'm able to make the files to write to but only very few get written before the error "unable to open file tagxxx.bam for writting". My thought was that sorting could be one way to overcome the issue but I may be wrong.

ADD REPLYlink modified 22 months ago • written 22 months ago by Srw10

Are you sure that some of rhe split bam files are generated and are correct. Normally a 'file not open for writing' problem is related to file permissions, not memory. Do you have permissions to create file in the directory where bams are supposed to be created? My hunch is that there is some silly mistake and it's not a memory issue

ADD REPLYlink written 22 months ago by Santosh Anand4.4k

Yes, I have permissions and a small number of the files are being written.

ADD REPLYlink written 22 months ago by Srw10

I believe that his happens as you correctly mention because the file is not sorted by your tag but not because of memory. It seems to try to open a new file when the tag is encountered. When you run on your unsorted (by tag) file, it tries to open for writing a file that has an already open file handler and breaks. So sorting might fix your issue (or a change in the code to append, instead of opening a new file every time).

ADD REPLYlink written 21 months ago by IV1.2k
1
gravatar for Pierre Lindenbaum
22 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum116k wrote:

Using awk and the tag 'NM' (insert a column containing the sort key, void for headers)

samtools view -h input.bam  |\
awk -F '\t' 'NR==1 {printf("\t@HD\t%s\tSO:unsorted\n",$2);next} /^@/ {printf("\t%s\n",$0);next;} {v="";for(i=12;i<=NR;++i) {split($i,a,/[\:]/);if(a[1]=="NM") {v=a[3];break;} } printf("%s\t%s\n",v,$0);}' |\
LC_ALL=C sort -t $'\t' -k1,1g  |\
cut -f 2-|\
samtools view -S -b -o out.bam
ADD COMMENTlink written 22 months ago by Pierre Lindenbaum116k

Thanks for the answer,

Would you be able to sort by tag 'XB' but then also by chromosome and position ? I tried :

samtools view -h input.bam  |\
awk -F '\t' 'NR==1 {printf("\t@HD\t%s\tSO:unsorted\n",$2);next} /^@/ {printf("\t%s\n",$0);next;} {v="";for(i=12;i<=NR;++i) {split($i,a,/[\:]/);if(a[1]=="NM") {v=a[3];break;} } printf("%s\t%s\n",v,$0);}' |\
LC_ALL=C sort -t $'\t' -k1,1g -k4.4,1n -k5,1n  |\
cut -f 2-|\
samtools view -S -b -o out.bam

Adding -k4.4,1n -k5,1n didn't do anything. The output bam is correctly sorted by tag ('XB') but not by chromosome and pos.

My bam line :

SN7001339:1083:HT3L3BCX2:1:1106:18349:82021 83  chr10   131390268   255 101M    =   131390102   -267    TTGGGCCTTGACTTCATCATGGGCTGCTTTCATCTTGTCTTCCTAATCCATTTCCACCCACCTGGACTCCTGAGCACCATAGTCCTCTGGCCTGGAGCCCC   IIIHH?GIIHFIIIHHIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIHIHIIIHHG?IIIHIHIIIHIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIDDDDD   XA:i:0  MD:Z:101    NM:i:0  XS:i:131390102  XB:Z:BC100069
ADD REPLYlink modified 10 days ago • written 10 days ago by pacome.pr0

I managed to make it work in the end -k1,1g -k4.4,1n -k5,1n must be -k1,1 -k4.4,4g -k5,5n. I don't know why I had to remove the "g" in front of -k1,1 to make this work. The chromosomes are not sorted numerically but this is not really important in my case.

ADD REPLYlink written 7 days ago by pacome.pr0
0
gravatar for cif077
19 months ago by
cif0770
Taiwan
cif0770 wrote:

Split by BX tag

samtools view input.bam |\
awk -F "\t" 'match($0, /BX:Z:[ACGT]{16}-1/) {if (RSTART>0){OUTPUT=substr($0,RSTART+5,18); print $0 >> OUTPUT".sam"}}'

Get original header

samtools view -H input.bam > header.sam

Create new bams

for f in BX*.sam
do
  b=`basename $f .sam`
  cat header.sam $f.sam | samtools view -b - > $b.bam
done
ADD COMMENTlink modified 19 months ago • written 19 months ago by cif0770

There is no BX*.sam, because OUTPUT=substr($0,RSTART+5,18)

ADD REPLYlink written 18 months ago by verne9110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1305 users visited in the last hour