Question: Sorting .bam file by tag
0
gravatar for Srw
17 months ago by
Srw10
Charlottesville, VA
Srw10 wrote:

Hi Everyone, I'm trying to split a .bam file by a specific tag but am running into memory issues because the .bam file is position sorted and not tag sorted.

I'd like to sort by a tag called 'BX' and cannot find any information in the samtools or bamtools documentation on this. If anyone has any suggestions it would be greatly appreciated.

Thanks!

sequencing bamtools samtools • 1.3k views
ADD COMMENTlink modified 14 months ago by cif0770 • written 17 months ago by Srw10

split or sort, what is your problem? Please specify your need by an example giving a couple of initial lines from the bam-files

ADD REPLYlink written 17 months ago by Santosh Anand3.7k

Ultimately I want to split by tag but I believe I am running into memory problems because it's not sorted by the tag itself.

Here is the first line of my .bam file

ST-K00126:303:HCJNFBBXX:5:2111:5030:25791 65 chr1 9999 0 11S66M50S chr5 18606706 0 TTCTTTTTCTGGATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAGAGATAACTATTGATACAACACCTTCATGACCCTAAGGTACTATCATAG JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJFJJAJFJFJJJJJFFJJAF7JJJJFJJJJAJJJFJJJA<F<A<- NM:i:0 MD:Z:66 AS:i:66 XS:i:65 SA:Z:chr5,18606884,-,52M75S,0,0; RX:Z:TCCACACGTCCGGTTG QX:Z:AAFFFJJJJJJJJAFJ BX:Z:TCCACACGTCCGGTTG-1 BC:Z:AACCGTAA QT:Z:AAFFFFFJ RG:Z:35127:35127:1:HCJNFBBXX:5

ADD REPLYlink modified 17 months ago • written 17 months ago by Srw10

What are all possible values of the tag BX? AFAIR, you can split only on RG tag using samtools, not on any arbitrary tag. But it should not be difficult as I guess you can easily "grep" each of the tags individually if they are not a lot of them, and direct output to different files. BTW, what commands have you tried?

ADD REPLYlink written 17 months ago by Santosh Anand3.7k

There are many values for BX so a grep approach would be very slow. I've done bamtools split -in myfile.bam -tag BX

This splits the original .bam but only writes to a few of them because I'm running into this memory issue which I think is because the file is not correctly sorted by 'BX'

ADD REPLYlink written 17 months ago by Srw10

If that works, wouldnt it be a good idea to split your data file into different sub-parts. Then join all your split files later

ADD REPLYlink written 17 months ago by sridhar56100

I'm unable to understand your reasoning :-( So how did you deduce that 1) it's a memory issue and 2) that the bam needs to be TAG-sorted as you mentioned previously. The doc doesnt say anything about the sorting requirement AFAI can see.

ADD REPLYlink written 17 months ago by Santosh Anand3.7k

Sorry this seems to be unclear. I thought I might be running into memory issues because I'm able to make the files to write to but only very few get written before the error "unable to open file tagxxx.bam for writting". My thought was that sorting could be one way to overcome the issue but I may be wrong.

ADD REPLYlink modified 17 months ago • written 17 months ago by Srw10

Are you sure that some of rhe split bam files are generated and are correct. Normally a 'file not open for writing' problem is related to file permissions, not memory. Do you have permissions to create file in the directory where bams are supposed to be created? My hunch is that there is some silly mistake and it's not a memory issue

ADD REPLYlink written 17 months ago by Santosh Anand3.7k

Yes, I have permissions and a small number of the files are being written.

ADD REPLYlink written 17 months ago by Srw10

I believe that his happens as you correctly mention because the file is not sorted by your tag but not because of memory. It seems to try to open a new file when the tag is encountered. When you run on your unsorted (by tag) file, it tries to open for writing a file that has an already open file handler and breaks. So sorting might fix your issue (or a change in the code to append, instead of opening a new file every time).

ADD REPLYlink written 16 months ago by IV1.2k
0
gravatar for Pierre Lindenbaum
17 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum111k wrote:

Using awk and the tag 'NM' (insert a column containing the sort key, void for headers)

samtools view -h input.bam  |\
awk -F '\t' 'NR==1 {printf("\t@HD\t%s\tSO:unsorted\n",$2);next} /^@/ {printf("\t%s\n",$0);next;} {v="";for(i=12;i<=NR;++i) {split($i,a,/[\:]/);if(a[1]=="NM") {v=a[3];break;} } printf("%s\t%s\n",v,$0);}' |\
LC_ALL=C sort -t $'\t' -k1,1g  |\
cut -f 2-|\
samtools view -S -b -o out.bam
ADD COMMENTlink written 17 months ago by Pierre Lindenbaum111k
0
gravatar for cif077
14 months ago by
cif0770
Taiwan
cif0770 wrote:

Split by BX tag

samtools view input.bam |\
awk -F "\t" 'match($0, /BX:Z:[ACGT]{16}-1/) {if (RSTART>0){OUTPUT=substr($0,RSTART+5,18); print $0 >> OUTPUT".sam"}}'

Get original header

samtools view -H input.bam > header.sam

Create new bams

for f in BX*.sam
do
  b=`basename $f .sam`
  cat header.sam $f.sam | samtools view -b - > $b.bam
done
ADD COMMENTlink modified 14 months ago • written 14 months ago by cif0770

There is no BX*.sam, because OUTPUT=substr($0,RSTART+5,18)

ADD REPLYlink written 13 months ago by verne9110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 660 users visited in the last hour