Extract ONLY chromosomes 1-22 from bam file - removing extraneous chr annotations
Entering edit mode
3.7 years ago
camerond ▴ 170

I've wrote this multi-facted question to solve a specific coding problem, but also to bring together information that is covered partially across multiple existing biostars posts that others may find useful.

Part 1 - Edit the bam file

I have several bam files that have reads mapped to chromosomes that I'm not interested in i.e.

  • 'chrUn_gl0000230' - over 40 of these
  • 'chr9_gl000198_random' - over ten of these
  • 'chr6_cox_hap2' - around 10 of these
  • 'chrM'

The unique identifiers for the first three groups are 'chrUn', 'random', 'hap'.

I tried this to get rid of everything except chr 1-22 and chr X-Y by cobbling together code from here and elsewhere:

samtools idxstats my.srtd.bam | grep -v chrUn* | grep -v *random | grep -v chrM | grep -v *hap* | xargs samtools view -b my.bam > my.only1to22andXY.srtd.bam

However, this eliminated only the chrM reads and the chrUn reads. I also tried this which gave the same result:

samtools idxstats my.srtd.bam | grep -v chrUn_* | grep -v chr*_random | grep -v chrM | grep -v chr*_hap* | xargs samtools view -b my.srtd.bam > my.only1to22andXY.srtd.bam

I suspect I'm having an issue with the * character but not sure why some chromosome reads are omitted and others are not.

What I really need is a one liner that retains only reads mapped chr1-22 (I don't need the chr X-Y either). Is this possible? It doesn't matter if it involves removing what I don't need, or keeping what I do need. Ideally this would be in bash.

Part 2 - Edit the Header

This leads to the next question of how to update the headers of the bam files. I need to omit the extraneous chromosome reads for a downstream application (ALTRE) as it crashes when they are present. There is a chance the package may also crash if the extra chromosome names are quoted in the header (even if the reads have been removed) and they read 0. Is it possible to clean the header to remove lines pertaining to extraneous chromosomes? Such as:

@SQ SN:chr11_gl000202_random    LN:40103

Part 3 - Edit the fasta file

I'm aligning to hg19. I just occurred to me that the best course of action (particularly for future analyses) is to remove these extraneous chromosomes at source, i.e. from the fasta files or the genome index files that I use for the alignment. Is that a valid thing to do? Could anyone suggest how to go about this?

genome next-gen sequencing bam file filtering • 14k views
Entering edit mode

Bash shell expansion works with samtools CLI:

$ samtools view -b input.bam chr{1..22} >output.bam
Entering edit mode

Part 2 - Edit the Header

never do this. you should awlays stick to your reference genome.

Entering edit mode

Extract ONLY chromosomes 1-22 from bam file

anyway, would do you want to do this ?....

Entering edit mode
3.7 years ago

create a BED for chr1 to chr22 from the fasta index of your reference genome.

awk '/^chr[0-9]*\t/ {printf("%s\t0\t%s\n",$1,$2);}' human_g1k_v37_prefix.fasta.fai  > your.bed

use this bed file to filter with samtools.

samtools view -L your.bed -o out.bam input.bam

another idea:

samtools view -o out.bam input.bam `seq 1 22 | sed 's/^/chr/'`
Entering edit mode

Thanks for your responses. I implemented the second suggestion as it had fewer steps and worked fine, all reads were removed.

I can use that one liner in my pipeline, thanks again.

I'll have a go using the downstream package and see if it likes the header or not.


Login before adding your answer.

Traffic: 2794 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6