Question

Filter bam file AND header

0

Entering edit mode

4.0 years ago

quentin54520 ▴ 120

Hello all,

First I'm sorry if this is a question for which there is already an answer but I haven't really been able to come up with a clear answer.

To align my reads I use a grch38 reference which contains lots of chrUn or decoy type contig.

After alignment I would like to remove these reads from the bam but also remove any reference to these contig in the header (because it seems that just having these contig in the header slows down some gatk tools). I had thought of ReorderSam but the problem is that it will sort the bam by coordinates, but for the rest, and in particular markduplicatespark the bam must be sorted by queryname so I would like to avoid it being sorted by coordinates to then have to re-sort it by queryname ...

Thank you in advance for any suggestions.

Quentin

filter header BAM • 1.5k views

ADD COMMENT • link 4.0 years ago by quentin54520 ▴ 120

0

Entering edit mode

After alignment I would like to remove these reads from the bam but also remove any reference to these contig in the header (because it seems that just having these contig in the header slows down some gatk tools).

this is usually a very bad idea. You should always keep the very same reference along all your workflow starting with the mapping of the reads.

to remove some contigs, just use samtools view -L bed_of_contigs_you_want_to_keep.bed -O BAM -o out.bam in.bam

ADD REPLY • link 4.0 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thank you for the answer.

this is usually a very bad idea

This is actually what I thought (I had used samtools view to remove unnecessary reads that mapped on these contigs but without touching the header) but GATK said "Yes, with more contigs, this can certainly slow down the job. I would recommend if possible removing those lines in the header. "

I don't really understand how this can slow down knowing that there is no longer any data on these contig (not only have I removed the reads, but in addition the variant calling is done by interval ...).

For info the tools that are very slow are GenomicsDBImport (run in more than 2 days for 6 gvcf on a sbatch jobs with 5 threads and 8Go of mem) and GenotypeGVCFs.

I will retry GenomicsDBImport with output database writing on the /tmp in the compute node instead of the /shared storage volume maybe it will be sufficient to speed up...

ADD REPLY • link 4.0 years ago by quentin54520 ▴ 120

1

Entering edit mode

For info the tools that are very slow are GenomicsDBImport (run in more than 2 days for 6 gvcf on a sbatch jobs with 5 threads and 8Go of mem) and GenotypeGVCFs.

split your jobs by region/chromosome

ADD REPLY • link 4.0 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Putting the output on the /tmp disk of the compute node instead of the /shared of the storage changes everything ... for both tools I went from 6 days to 15 hours. but for the next ones I keep the option to do by interval

ADD REPLY • link 4.0 years ago by quentin54520 ▴ 120