Question: How to subset bamfile based on chromosome IDs
0
gravatar for mernster
8 months ago by
mernster0
mernster0 wrote:

Hi,

I would like to subset a large bam file so that I can do the variant calling in parallel (using freebayes).

I have tried to subset the bamfile based on subsets of the reference sequence. First, I subset the fasta reference sequence into fasta files containing the equal amount of chromosomes. Then, I used the following command samtools view file.bam -T subset.fasta -b > subset.bam to subset the bam file so that it extracts only the reads that map to the chromosomes included in the respective reference subset:

When I view subset.bam in IGV it only shows the chromosomes that are included in the respective reference sequence which tells me that it worked. However, when I do the variant calling, I get an error saying that freebayes was unable to find a FASTA index entry for the chromosomes that don't belong to that reference subset. Obviously, they are not there because I removed them when I subset the reference.

Any suggestions on what might have gone wrong? If not, does anybody know a different way to subset a bam file into equal parts based on chromosome IDs?

subset freebayes bam • 236 views
ADD COMMENTlink modified 8 months ago by ATpoint36k • written 8 months ago by mernster0

might be due to the header present ?

I usually manually add the header to only include those of the sequences in the subset because indeed some software complains on the fact they see sequences being listed in the header for which there is no data in the bam/sam file

extract the headers from the subset bam file and inspect which ones are present .

Personally I loop over all sequences I want to have included in the subset and add data for each one by one. I'm not sure if the -T option is doing what you think it might

ADD REPLYlink modified 8 months ago • written 8 months ago by lieven.sterck8.0k

Why not just use freebayes-parallel?

Check this github thread: https://github.com/ekg/freebayes/issues/465

ADD REPLYlink written 8 months ago by harish290
0
gravatar for ATpoint
8 months ago by
ATpoint36k
Germany
ATpoint36k wrote:

The command should be e.g. samtools view -b -o subset_chr1.bam file.bam chr1 The purpose of -T is to include the reference fasta sequence into the file, e.g. for writing CRAM format. I think what you did was to include the reference sequence of a specific chromosome into the fasta and that messed things up.

ADD COMMENTlink written 8 months ago by ATpoint36k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1116 users visited in the last hour