I've wrote this multi-facted question to solve a specific coding problem, but also to bring together information that is covered partially across multiple existing biostars posts that others may find useful.
Part 1 - Edit the bam file
I have several bam files that have reads mapped to chromosomes that I'm not interested in i.e.
- 'chrUn_gl0000230' - over 40 of these
- 'chr9_gl000198_random' - over ten of these
- 'chr6_cox_hap2' - around 10 of these
- 'chrM'
The unique identifiers for the first three groups are 'chrUn', 'random', 'hap'.
I tried this to get rid of everything except chr 1-22 and chr X-Y by cobbling together code from here and elsewhere:
samtools idxstats my.srtd.bam | grep -v chrUn* | grep -v *random | grep -v chrM | grep -v *hap* | xargs samtools view -b my.bam > my.only1to22andXY.srtd.bam
However, this eliminated only the chrM reads and the chrUn reads. I also tried this which gave the same result:
samtools idxstats my.srtd.bam | grep -v chrUn_* | grep -v chr*_random | grep -v chrM | grep -v chr*_hap* | xargs samtools view -b my.srtd.bam > my.only1to22andXY.srtd.bam
I suspect I'm having an issue with the * character but not sure why some chromosome reads are omitted and others are not.
What I really need is a one liner that retains only reads mapped chr1-22 (I don't need the chr X-Y either). Is this possible? It doesn't matter if it involves removing what I don't need, or keeping what I do need. Ideally this would be in bash.
Part 2 - Edit the Header
This leads to the next question of how to update the headers of the bam files. I need to omit the extraneous chromosome reads for a downstream application (ALTRE) as it crashes when they are present. There is a chance the package may also crash if the extra chromosome names are quoted in the header (even if the reads have been removed) and they read 0. Is it possible to clean the header to remove lines pertaining to extraneous chromosomes? Such as:
@SQ SN:chr11_gl000202_random LN:40103
Part 3 - Edit the fasta file
I'm aligning to hg19. I just occurred to me that the best course of action (particularly for future analyses) is to remove these extraneous chromosomes at source, i.e. from the fasta files or the genome index files that I use for the alignment. Is that a valid thing to do? Could anyone suggest how to go about this?
Bash shell expansion works with samtools CLI:
see also: Remove mitochondrial reads from BAM files ; How to remove a chr from a bam file?
never do this. you should awlays stick to your reference genome.
anyway, would do you want to do this ?....