Question: How To Split A Bam File By Chromosome
6
gravatar for GPR
8.5 years ago by
GPR320
Mexico
GPR320 wrote:

Hello, I am having a hard time opening a very large bedgraph. I have been suggested to split my bam file by chromosome with chrom-bed.py but it didn't work. Is there any other alternative? Thanks, GP.

bam split chromosome • 63k views
ADD COMMENTlink modified 2.8 years ago by hewm200840 • written 8.5 years ago by GPR320
1

How are you having a hard time opening the file? What happens? How did your script to split the bam file "not work"? These details may help people answer your question.

ADD REPLYlink written 8.5 years ago by Alex Paciorkowski3.4k

I am pretty much having trouble with the indexing and header of my bam file. I have indexed in various ways, but keep having trouble. Have tried chrom-bed.py and samtools view chr1 > chr1.bam. I have indexed with samtools and bamtools, I even tried sorting the file before indexing. Any suggestions?

ADD REPLYlink written 8.5 years ago by GPR320
1

bam != bedgraph, fyi.

ADD REPLYlink written 8.5 years ago by Madelaine Gogol5.2k

I have indexed and when running samtools virew, I get the message fail to read the header from bam file. Any tips?

@ Madelaine: yes I have converted my bam files to bedgraph ones successfully. The problem is the size, that's why I want to split them by chromosome

ADD REPLYlink written 8.5 years ago by GPR320

Try samtools reheader to get the header back.

ADD REPLYlink written 8.5 years ago by Michael Dondrup48k

@genaro Oh, sorry.

You can also use picard CreateSequenceDictionary on the genome fasta and add that to the top of the sam file if the sam file is missing a header.

ADD REPLYlink written 8.5 years ago by Madelaine Gogol5.2k
20
gravatar for Jorge Amigo
7.6 years ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

in this other answer, Aaron Quinlan stated:

bamtools has a "split" command for exactly this purpose

I can only add that I've just tried it with this simple command

bamtools split -in file.bam -reference

and it works like a charm. the bam file gets split into different bam files, which are suffixed with .REF_xxx.bam by default, which is very convenient.

ADD COMMENTlink written 7.6 years ago by Jorge Amigo12k

This has the advantage of only generating a bamfile for the reference that you want.

ADD REPLYlink written 3.1 years ago by russ.bainer0

Is it possible to specify output directory, and to mention which specify chromosome needs to be extracted (for eg., just chr5 and chr22)?

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by mg100
14
gravatar for Michael Dondrup
8.5 years ago by
Bergen, Norway
Michael Dondrup48k wrote:

Try samtools: samtools view -?

A region should be presented in one of the following formats: `chr1',`chr2:1,000' and `chr3:1000-2,000'. When a region is specified, the input alignment file must be an indexed BAM file.

something like samtools view in.bam chr1 > chr1.bam should work

ADD COMMENTlink modified 13 months ago by RamRS30k • written 8.5 years ago by Michael Dondrup48k
7

You're missing the -b flag (bam output). So something like:

​samtools view -b in.bam chr1 > in_chr1.bam

NOTE: The sequence dictionary (@SQ header lines) will still contain entries for everything. This can cause problems if the tools you're feeding those split bam files into use that header information. For example, picard CollectWgsMetrics will still assume that the bam is supposed to cover the whole genome, not just a single chrom. I'm certain lots of other tools will have similar problems.

ADD REPLYlink modified 13 months ago by RamRS30k • written 6.1 years ago by travcollier160
1

Is there a way to prevent each bam file to have entries for everything in the sequence dictionary (@SQ)? Or is there a way to recreate/parse/filter the header for each specific bam?

ADD REPLYlink written 2.3 years ago by uribe.convers10

I would try the bamtools way (Jorge Amigo's answer) instead. It might be possible to parse and filter the entries going through text format, but it can also easily mess up everything.

ADD REPLYlink written 17 months ago by Michael Dondrup48k

A solution would be converting bam to fastq file, then use the fastq to map to specific chromosome

ADD REPLYlink written 4.6 years ago by Chen1.0k

This is silly as it's slow and the file wouldn't be indexed.

ADD REPLYlink written 4.4 years ago by SmallChess540
12
gravatar for SHI Quan
7.3 years ago by
SHI Quan120
shenzhen
SHI Quan120 wrote:
samtools view in.bam chr1 -b > out.bam

Use -b to output bam format

ADD COMMENTlink modified 13 months ago by RamRS30k • written 7.3 years ago by SHI Quan120
6
gravatar for Pierre Lindenbaum
7.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

I wrote a java tool to split a BAM per chromosome see http://code.google.com/p/jvarkit/wiki/SplitBam

It also creates an empty BAM (filled with a pair of mock SAMRecords) for each chromosome in the Reference, if no SAMRecord was found for the chromosome.

ADD COMMENTlink written 7.6 years ago by Pierre Lindenbaum131k
2
gravatar for Shicheng Guo
4.9 years ago by
Shicheng Guo8.5k
Shicheng Guo8.5k wrote:

You can use the following pipeline to extract chrY reads from the raw bam files and with the header

samtools sort A.bam -o A.sort.bam
samtools index A.sort.bam
samtools view -H A.sort.bam > output.extraction.sam
samtools view A.sort.bam chrY >> output.extraction.sam
samtools view -hb output.extraction.sam > output.extraction.bam
samtools view  -H output.extraction.bam

output.extraction.bam is the bam file which extracted chrY reads.

ADD COMMENTlink modified 13 months ago by RamRS30k • written 4.9 years ago by Shicheng Guo8.5k
3

This isn't really an answer to the original question.

Plus if you want to extract chrY reads you only need a single command:

samtools view -b in.bam chrY > out.bam
ADD REPLYlink modified 13 months ago by RamRS30k • written 4.9 years ago by Jorge Amigo12k

This is true but still requires a sorted and indexed bam file. From the samtools manual:

Use of region specifications requires a coordinate-sorted and indexed input file (in BAM or CRAM format).

samtools sort A.bam -o A.sort.bam
samtools index A.sort.bam
samtools view -b A.sort.bam chrY > output.extraction.bam
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by goodez480

Using -b produces smaller bam than skipping it. Why so? For chrX, for example, I observed a bam of 2.1 GB with -b option but a 12 GB BAM. And they can be viewed with less or head command which is quite unusual.

ADD REPLYlink written 8 weeks ago by rohitsatyam102200
1
gravatar for hewm2008
2.8 years ago by
hewm200840
hewm200840 wrote:

This soft can help all of you

https://github.com/BGI-shenzhen/BamSplit

ADD COMMENTlink written 2.8 years ago by hewm200840

well done. xiao-ming

ADD REPLYlink written 9 months ago by Lhl730
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1874 users visited in the last hour