I routinely cleanse my SAM files of chrM, and unassembled "random" contigs before running ChIP-seq analysis. I use 'sed' on the SAM file. Although you could be clever and do this via 'samtools view' without the need for creating an intermediate SAM file :)
sed '/chrM/d;/random/d;/chrUn/d' < file.sam > file_filtered.sam
Surprisingly, there's no way to do that with the stock version of samtools. You can do this with bedtools by intersecting with a BED file containing all of the chromosomes except MT. That might be faster than awk, depending on how bedtools implemented that.
BTW, in the off chance that you need to do this with a very large number of samples such that any awk/bedtools-based solution is too slow, this would be pretty easy to code in C with HTSlib (just let me know, I could write such a program in a couple minutes).
I've been struggling with this and don't want to repeat a question so therefore posting here.
I've tried all the solutions and none of them are working for me. Every time I try these, if I then run samtools idxstats to see if it works, nothing changes. I even tried the long solution from matted of samtools view and typing out every single chromosome I want to keep (1-22 & X/Y) only to find that I just can't get rid of the chrM, all the chrUn etc etc. I also tried this
which also didn't work to get rid of chrUn. So if starting with a bam file with all these chrUn_g* and chrM etc etc in, how can I simply get rid of them? I'll happily take a dirty messy bash one liner if it gets rid of them.
I know I'm probably doing something stupid as I'm a wet lab scientist and really don't know how to do bioinformatics properly. Unfortunately like others i'm being thrown into it with no time to learn properly and with no-one to teach me. I'm sure you real bioinformaticians out there are sick of newbies not having a clue.
I noticed this answer after I added mine - much neater!
I have a 10GB BAM file of human transcriptome, I split it into chromosome-wise and all the splited BAM file s' size is less than 10GB it means some data has been lost. Can you please explain it
Check the number of reads per file, not the size, as it depends on the compression level.
I'd count the lines, instead of looking at file sizes.
It's also possible your original BAM kept the unmapped reads, and when you split the BAM into chromosomes the unmapped reads had nowhere to go.
Hi, is there a way to remove all entries after chromosome Y? I am trying to do this in a bedgraph file but the same will suffice in a BAM file.
Thanks in advance.
so does this (samtools view -b input.bam chr1 chr2 chr3 chr4 > output.bam) remove the listed chrosomes or actually create a new file only with those chr?