Question

How to subset bamfile based on chromosome IDs

0

Entering edit mode

4.5 years ago

mernster • 0

Hi,

I would like to subset a large bam file so that I can do the variant calling in parallel (using freebayes).

I have tried to subset the bamfile based on subsets of the reference sequence. First, I subset the fasta reference sequence into fasta files containing the equal amount of chromosomes. Then, I used the following command samtools view file.bam -T subset.fasta -b > subset.bam to subset the bam file so that it extracts only the reads that map to the chromosomes included in the respective reference subset:

When I view subset.bam in IGV it only shows the chromosomes that are included in the respective reference sequence which tells me that it worked. However, when I do the variant calling, I get an error saying that freebayes was unable to find a FASTA index entry for the chromosomes that don't belong to that reference subset. Obviously, they are not there because I removed them when I subset the reference.

Any suggestions on what might have gone wrong? If not, does anybody know a different way to subset a bam file into equal parts based on chromosome IDs?

bam subset freebayes • 1.4k views

ADD COMMENT • link updated 4.5 years ago by ATpoint 81k • written 4.5 years ago by mernster • 0

0

Entering edit mode

might be due to the header present ?

I usually manually add the header to only include those of the sequences in the subset because indeed some software complains on the fact they see sequences being listed in the header for which there is no data in the bam/sam file

extract the headers from the subset bam file and inspect which ones are present .

Personally I loop over all sequences I want to have included in the subset and add data for each one by one. I'm not sure if the -T option is doing what you think it might

ADD REPLY • link 4.5 years ago by lieven.sterck 15k

0

Entering edit mode

Why not just use freebayes-parallel?

Check this github thread: https://github.com/ekg/freebayes/issues/465

ADD REPLY • link 4.5 years ago by harish ▴ 450

score 0 · Answer 1 · 2019-10-11

The command should be e.g. samtools view -b -o subset_chr1.bam file.bam chr1 The purpose of -T is to include the reference fasta sequence into the file, e.g. for writing CRAM format. I think what you did was to include the reference sequence of a specific chromosome into the fasta and that messed things up.