unsuccessful split of bam file by samtools and bamtools
1
0
Entering edit mode
5.6 years ago

Hi everyone.

I am trying to split a bam file in order to obtain only a bam file of a chromosome 19. I tried with both samtools and bamtools.

With samtools:

 samtools view  -b /dbfs/mnt/vg/wgs_bam%2FNA12878_20k_hg38%2FNA12878.bam chr19 > /dbfs/mnt/vg/wgs_bam%2FNA12878_20k_hg38%2FNA12878.chr19.bam

And bamtools:

bamtools split -in /dbfs/mnt/vg/wgs_bam%2FNA12878_20k_hg38%2FNA12878.bam  -reference

In both cases a header of split file includes all chromosomes. I tried also with other bam files the same thing and always get the same split output which header contains all chromosome instead of only chromosome 19. Would be very grateful for any advice.

Thank you a lot.

samtools bam bamtools • 4.0k views
ADD COMMENT
0
Entering edit mode

What do you mean by :

In both cases a header of split file includes all chromosomes

Can we have a visual representation of this ?

ADD REPLY
0
Entering edit mode

Thank you a lot! This is exactly what I overlooked. For a specific variant caller, I need to modify a header of bam file in order to contain only information of chromosome 19. Any advice on how to do it? Thank you a lot.

ADD REPLY
0
Entering edit mode

This is AFAIK not necessary. Keep in mind that most variant callers offer options to only use reads of a certain regions or chromosome, so there is often no need to filter the BAM file. Which one do you use?

ADD REPLY
0
Entering edit mode

It was recently developed, based on neural network, NeuSomatic. It seems to me that header of reference and bam files must match completely.

ADD REPLY
1
Entering edit mode

I think that should never be a requirement since it forces people to mess with the header of their bam files. Maybe it is worth to ask the authors of the package to change this. Anyway, if you really want to do this, here's a way (at your own risk):

samtools view -H in.bam | awk 'BEGIN { FS = OFS = "\t"; } {if ($1 == "@SQ" && $2 != "SN:chr19") { } else print; }' | samtools reheader - in.bam > out.bam
ADD REPLY
3
Entering edit mode
5.6 years ago
thomaskuilman ▴ 850

You seem to be doing things alright; if you perform

samtools view X.bam 19 | head -n 2
X.66615581     163     19      60000   0       76M     =       60051   127     AGATCACAGAGGCTGGGCTGCTCCCCACCCTCTGCACACCTCCTGCTTCTAACAGCAGAGCTGCCAGGCCAGGCCC    AAFFFJJJJFFJ7FAF77<<JJAJJJJJFJJJJJFJJFJJFJJJJJJFJJJA<JJF-FFJJ7--AAJF7FAAJJ##    XT:A:R  NM:i:0  XN:i:1  SM:i:0  AM:i:0  X0:i:4  X1:i:5  XM:i:0  XO:i:0  XG:i:0  MD:Z:76
X.46228        163     19      60001   0       76M     =       60048   123     GATCACAGAGGCTGGGCTGCTCCCCACCCTCTGCACACCTCCTGCTTCTAACAGCAGAGCTGCCAGGCCAGGCCCT    AA<FFJJJJJJJ<FJJJFFJJJJJJJJJJJJJJJJJJJJFJJJ7JJJJJJJJJJJJJJJJJAJJJJJJAAF<JJJ<    XT:A:R  NM:i:0  SM:i:0  AM:i:0  X0:i:4  X1:i:5  XM:i:0  XO:i:0  XG:i:0  MD:Z:76

you will only get the output of chromosome 19 (in my case there is no "chr" prefix in the chromosome names). In the header, however, all the chromosomes will be mentioned:

samtools view -h X.bam 19 | head -n 6
@HD     VN:1.3  SO:coordinate
@SQ     SN:1    LN:248956422
@SQ     SN:10   LN:133797422
@SQ     SN:11   LN:135086622
@SQ     SN:12   LN:133275309
@SQ     SN:13   LN:114364328

You might be confused by the fact that all original chromosomes are still represented in the header (albeit NOT in the actual read data), but there is nothing wrong with this since the header is not part of the actual data (i.e., read data).

ADD COMMENT
2
Entering edit mode

In addition to that, simply index the BAM and run samtools idxstats. You'll see that the counts for all chromosomes except 19 will be 0.

ADD REPLY

Login before adding your answer.

Traffic: 2652 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6