Sorting .bam file by tag
2
0
Entering edit mode
7.1 years ago
Srw ▴ 60

Hi Everyone, I'm trying to split a .bam file by a specific tag but am running into memory issues because the .bam file is position sorted and not tag sorted.

I'd like to sort by a tag called 'BX' and cannot find any information in the samtools or bamtools documentation on this. If anyone has any suggestions it would be greatly appreciated.

Thanks!

bamtools samtools sequencing • 6.8k views
ADD COMMENT
0
Entering edit mode

split or sort, what is your problem? Please specify your need by an example giving a couple of initial lines from the bam-files

ADD REPLY
0
Entering edit mode

Ultimately I want to split by tag but I believe I am running into memory problems because it's not sorted by the tag itself.

Here is the first line of my .bam file

ST-K00126:303:HCJNFBBXX:5:2111:5030:25791 65 chr1 9999 0 11S66M50S chr5 18606706 0 TTCTTTTTCTGGATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAGAGATAACTATTGATACAACACCTTCATGACCCTAAGGTACTATCATAG JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJFJJAJFJFJJJJJFFJJAF7JJJJFJJJJAJJJFJJJA<F<A<- NM:i:0 MD:Z:66 AS:i:66 XS:i:65 SA:Z:chr5,18606884,-,52M75S,0,0; RX:Z:TCCACACGTCCGGTTG QX:Z:AAFFFJJJJJJJJAFJ BX:Z:TCCACACGTCCGGTTG-1 BC:Z:AACCGTAA QT:Z:AAFFFFFJ RG:Z:35127:35127:1:HCJNFBBXX:5

ADD REPLY
0
Entering edit mode

What are all possible values of the tag BX? AFAIR, you can split only on RG tag using samtools, not on any arbitrary tag. But it should not be difficult as I guess you can easily "grep" each of the tags individually if they are not a lot of them, and direct output to different files. BTW, what commands have you tried?

ADD REPLY
0
Entering edit mode

There are many values for BX so a grep approach would be very slow. I've done bamtools split -in myfile.bam -tag BX

This splits the original .bam but only writes to a few of them because I'm running into this memory issue which I think is because the file is not correctly sorted by 'BX'

ADD REPLY
0
Entering edit mode

If that works, wouldnt it be a good idea to split your data file into different sub-parts. Then join all your split files later

ADD REPLY
0
Entering edit mode

I'm unable to understand your reasoning :-( So how did you deduce that 1) it's a memory issue and 2) that the bam needs to be TAG-sorted as you mentioned previously. The doc doesnt say anything about the sorting requirement AFAI can see.

ADD REPLY
0
Entering edit mode

Sorry this seems to be unclear. I thought I might be running into memory issues because I'm able to make the files to write to but only very few get written before the error "unable to open file tagxxx.bam for writting". My thought was that sorting could be one way to overcome the issue but I may be wrong.

ADD REPLY
0
Entering edit mode

Are you sure that some of rhe split bam files are generated and are correct. Normally a 'file not open for writing' problem is related to file permissions, not memory. Do you have permissions to create file in the directory where bams are supposed to be created? My hunch is that there is some silly mistake and it's not a memory issue

ADD REPLY
0
Entering edit mode

Yes, I have permissions and a small number of the files are being written.

ADD REPLY
0
Entering edit mode

I believe that his happens as you correctly mention because the file is not sorted by your tag but not because of memory. It seems to try to open a new file when the tag is encountered. When you run on your unsorted (by tag) file, it tries to open for writing a file that has an already open file handler and breaks. So sorting might fix your issue (or a change in the code to append, instead of opening a new file every time).

ADD REPLY
1
Entering edit mode
7.1 years ago

Using awk and the tag 'NM' (insert a column containing the sort key, void for headers)

samtools view -h input.bam  |\
awk -F '\t' 'NR==1 {printf("\t@HD\t%s\tSO:unsorted\n",$2);next} /^@/ {printf("\t%s\n",$0);next;} {v="";for(i=12;i<=NR;++i) {split($i,a,/[\:]/);if(a[1]=="NM") {v=a[3];break;} } printf("%s\t%s\n",v,$0);}' |\
LC_ALL=C sort -t $'\t' -k1,1g  |\
cut -f 2-|\
samtools view -S -b -o out.bam
ADD COMMENT
0
Entering edit mode

Thanks for the answer,

Would you be able to sort by tag 'XB' but then also by chromosome and position ? I tried :

samtools view -h input.bam  |\
awk -F '\t' 'NR==1 {printf("\t@HD\t%s\tSO:unsorted\n",$2);next} /^@/ {printf("\t%s\n",$0);next;} {v="";for(i=12;i<=NR;++i) {split($i,a,/[\:]/);if(a[1]=="NM") {v=a[3];break;} } printf("%s\t%s\n",v,$0);}' |\
LC_ALL=C sort -t $'\t' -k1,1g -k4.4,1n -k5,1n  |\
cut -f 2-|\
samtools view -S -b -o out.bam

Adding -k4.4,1n -k5,1n didn't do anything. The output bam is correctly sorted by tag ('XB') but not by chromosome and pos.

My bam line :

SN7001339:1083:HT3L3BCX2:1:1106:18349:82021 83  chr10   131390268   255 101M    =   131390102   -267    TTGGGCCTTGACTTCATCATGGGCTGCTTTCATCTTGTCTTCCTAATCCATTTCCACCCACCTGGACTCCTGAGCACCATAGTCCTCTGGCCTGGAGCCCC   IIIHH?GIIHFIIIHHIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIHIHIIIHHG?IIIHIHIIIHIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIDDDDD   XA:i:0  MD:Z:101    NM:i:0  XS:i:131390102  XB:Z:BC100069
ADD REPLY
0
Entering edit mode

I managed to make it work in the end -k1,1g -k4.4,1n -k5,1n must be -k1,1 -k4.4,4g -k5,5n. I don't know why I had to remove the "g" in front of -k1,1 to make this work. The chromosomes are not sorted numerically but this is not really important in my case.

ADD REPLY
0
Entering edit mode
6.8 years ago
cif077 • 0

Split by BX tag

samtools view input.bam |\
awk -F "\t" 'match($0, /BX:Z:[ACGT]{16}-1/) {if (RSTART>0){OUTPUT=substr($0,RSTART+5,18); print $0 >> OUTPUT".sam"}}'

Get original header

samtools view -H input.bam > header.sam

Create new bams

for f in BX*.sam
do
  b=`basename $f .sam`
  cat header.sam $f.sam | samtools view -b - > $b.bam
done
ADD COMMENT
0
Entering edit mode

There is no BX*.sam, because OUTPUT=substr($0,RSTART+5,18)

ADD REPLY

Login before adding your answer.

Traffic: 2793 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6