Question: Extract ONLY chromosomes 1-22 from bam file - removing extraneous chr annotations
0
gravatar for camerond
9 months ago by
camerond10
camerond10 wrote:

I've wrote this multi-facted question to solve a specific coding problem, but also to bring together information that is covered partially across multiple existing biostars posts that others may find useful.

Part 1 - Edit the bam file

I have several bam files that have reads mapped to chromosomes that I'm not interested in i.e.

  • 'chrUn_gl0000230' - over 40 of these
  • 'chr9_gl000198_random' - over ten of these
  • 'chr6_cox_hap2' - around 10 of these
  • 'chrM'

The unique identifiers for the first three groups are 'chrUn', 'random', 'hap'.

I tried this to get rid of everything except chr 1-22 and chr X-Y by cobbling together code from here and elsewhere:

samtools idxstats my.srtd.bam | grep -v chrUn* | grep -v *random | grep -v chrM | grep -v *hap* | xargs samtools view -b my.bam > my.only1to22andXY.srtd.bam

However, this eliminated only the chrM reads and the chrUn reads. I also tried this which gave the same result:

samtools idxstats my.srtd.bam | grep -v chrUn_* | grep -v chr*_random | grep -v chrM | grep -v chr*_hap* | xargs samtools view -b my.srtd.bam > my.only1to22andXY.srtd.bam

I suspect I'm having an issue with the * character but not sure why some chromosome reads are omitted and others are not.

What I really need is a one liner that retains only reads mapped chr1-22 (I don't need the chr X-Y either). Is this possible? It doesn't matter if it involves removing what I don't need, or keeping what I do need. Ideally this would be in bash.

Part 2 - Edit the Header

This leads to the next question of how to update the headers of the bam files. I need to omit the extraneous chromosome reads for a downstream application (ALTRE) as it crashes when they are present. There is a chance the package may also crash if the extra chromosome names are quoted in the header (even if the reads have been removed) and they read 0. Is it possible to clean the header to remove lines pertaining to extraneous chromosomes? Such as:

@SQ SN:chr11_gl000202_random    LN:40103

Part 3 - Edit the fasta file

I'm aligning to hg19. I just occurred to me that the best course of action (particularly for future analyses) is to remove these extraneous chromosomes at source, i.e. from the fasta files or the genome index files that I use for the alignment. Is that a valid thing to do? Could anyone suggest how to go about this?

ADD COMMENTlink modified 9 months ago by Pierre Lindenbaum115k • written 9 months ago by camerond10
1

Bash shell expansion works with samtools CLI:

$ samtools view -b input.bam chr{1..22} >output.bam
ADD REPLYlink modified 9 months ago • written 9 months ago by cpad011210k

Part 2 - Edit the Header

never do this. you should awlays stick to your reference genome.

ADD REPLYlink modified 9 months ago • written 9 months ago by Pierre Lindenbaum115k

Extract ONLY chromosomes 1-22 from bam file

anyway, would do you want to do this ?....

ADD REPLYlink modified 9 months ago • written 9 months ago by Pierre Lindenbaum115k
2
gravatar for Pierre Lindenbaum
9 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum115k wrote:

create a BED for chr1 to chr22 from the fasta index of your reference genome.

awk '/^chr[0-9]*\t/ {printf("%s\t0\t%s\n",$1,$2);}' human_g1k_v37_prefix.fasta.fai  > your.bed

use this bed file to filter with samtools.

samtools view -L your.bed -o out.bam input.bam

another idea:

samtools view -o out.bam input.bam `seq 1 22 | sed 's/^/chr/'`
ADD COMMENTlink modified 9 months ago • written 9 months ago by Pierre Lindenbaum115k

Thanks for your responses. I implemented the second suggestion as it had fewer steps and worked fine, all reads were removed.

I can use that one liner in my pipeline, thanks again.

I'll have a go using the downstream package and see if it likes the header or not.

ADD REPLYlink modified 9 months ago • written 9 months ago by camerond10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1761 users visited in the last hour