Question: Remove mitochondrial reads from BAM files
4
gravatar for enricoferrero
4.6 years ago by
enricoferrero770
United Kingdom
enricoferrero770 wrote:

Hi,

I have a somewhat high content of mitochondrial RNA in my RNA-seq experiment. Is there a way to use samtools to remove alignments to the 'MT' chromosome and keep all the rest?

I'm considering using samtools view in combination with awk but perhaps there's a better/cleaner solution?

Thanks!

awk rna-seq samtools bam • 16k views
ADD COMMENTlink modified 23 months ago by Simply Bioinformatics150 • written 4.6 years ago by enricoferrero770
20
gravatar for matted
4.6 years ago by
matted7.1k
Boston, United States
matted7.1k wrote:

You can use samtools to do it by selecting particular chromosomes and never leaving the binary format, for example:

samtools view -b input.bam chr1 chr2 chr3 chr4 > output.bam

That's a bit unwieldy to list all non-mitochondrial chromosomes manually, so you could try an ugly bash one-liner like:

samtools idxstats input.bam | cut -f 1 | grep -v MT | xargs samtools view -b input.bam > output.bam

 

ADD COMMENTlink written 4.6 years ago by matted7.1k

I noticed this answer after I added mine - much neater!

 

ADD REPLYlink written 4.6 years ago by Ian5.5k

I have a 10GB BAM file of human transcriptome, I split it into chromosome-wise and all the splited BAM file s' size is less than 10GB it means some data has been lost. Can you please explain it

ADD REPLYlink written 2.0 years ago by qudrat70

Check the number of reads per file, not the size, as it depends on the compression level.

ADD REPLYlink written 23 months ago by ATpoint23k

I'd count the lines, instead of looking at file sizes.

ADD REPLYlink written 23 months ago by swbarnes26.5k

It's also possible your original BAM kept the unmapped reads, and when you split the BAM into chromosomes the unmapped reads had nowhere to go.

ADD REPLYlink written 16 months ago by nick.carleson10

Hi, is there a way to remove all entries after chromosome Y? I am trying to do this in a bedgraph file but the same will suffice in a BAM file.

Thanks in advance.

ADD REPLYlink written 6 months ago by checkyodna20
6
gravatar for Ian
4.6 years ago by
Ian5.5k
University of Manchester, UK
Ian5.5k wrote:

I routinely cleanse my SAM files of chrM, and unassembled "random" contigs before running ChIP-seq analysis.  I use 'sed' on the SAM file.  Although you could be clever and do this via 'samtools view' without the need for creating an intermediate SAM file :)

sed '/chrM/d;/random/d;/chrUn/d' < file.sam > file_filtered.sam

 

 

 

ADD COMMENTlink written 4.6 years ago by Ian5.5k

this approach worked the best for me. thanks!

ADD REPLYlink written 3.7 years ago by mrs.hope10
3
gravatar for Pierre Lindenbaum
4.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

Using my tool samjs: https://github.com/lindenb/jvarkit/wiki/SamJS

java -jar samjs.jar  -e 'record.getReadUnmappedFlag() || !record.getReferenceName().equals("MT")' in.bam

 

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Pierre Lindenbaum123k

Pierre, I tried using your tool (well the Picardtools FilterSamReads) and it keeps failing with the error: ERROR 2016-08-31 12:31:56 FilterSamReads Failed to filter Normal.sort.rmdup.bam java.lang.IllegalStateException: Records B05AGABXX110313:4:1207:12805:19638 (chr1:10,001) should come after B0590ABXX110313:6:2208:18151:161276 (chr1:10,002) when sorting with htsjdk.samtools.SAMRecordQueryNameComparator

These bam files were sorted with Picardtools SortSam and duplicates removed. Any recommendations? Should I just try the tool on its own and not the Picardtools version? Thanks.

ADD REPLYlink written 3.1 years ago by Vanish00710

what's the version of picard ?

ADD REPLYlink written 3.1 years ago by Pierre Lindenbaum123k

Thanks for the reply, It's Picard 2.1.1

ADD REPLYlink written 3.1 years ago by Vanish00710

strange. It looks like your bam is wrongly sorted. No idea.

ADD REPLYlink written 3.1 years ago by Pierre Lindenbaum123k

Thanks. Yeah any idea on how or why a bam would be sorted incorrectly? I mean I used Picardtools to sort it. I suppose I could try sorting it with samtools...

ADD REPLYlink written 3.1 years ago by Vanish00710
3
gravatar for Simply Bioinformatics
23 months ago by
WashingtonDC
Simply Bioinformatics150 wrote:

This solution will keep the headers intact so you can extract fastq file

samtools view -h this.bam | awk '{if($3 != "chrM" && $3 != "chrUn"){print $0}}' | samtools view -Shb - > this.filter.bam
ADD COMMENTlink modified 6 months ago • written 23 months ago by Simply Bioinformatics150

shorter:

awk  '($3 != "chrM" && $3 != "chrUn")'
ADD REPLYlink written 23 months ago by Pierre Lindenbaum123k

faster:

mawk  '($3 != "chrM" && $3 != "chrUn")'

=)

ADD REPLYlink written 23 months ago by ATpoint23k

can i do it for multiple bam files in a folder using parallel ?

"samtools view -h this.bam | awk '{if($3 != "chrM" && $3 != "chrUn"){print $0}}' | samtools view -Shb - > this.filter.bam
ADD REPLYlink written 5 months ago by krushnach80580
2
gravatar for Devon Ryan
4.6 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

Surprisingly, there's no way to do that with the stock version of samtools. You can do this with bedtools by intersecting with a BED file containing all of the chromosomes except MT. That might be faster than awk, depending on how bedtools implemented that.

BTW, in the off chance that you need to do this with a very large number of samples such that any awk/bedtools-based solution is too slow, this would be pretty easy to code in C with HTSlib (just let me know, I could write such a program in a couple minutes).

ADD COMMENTlink written 4.6 years ago by Devon Ryan92k

Of course, the bedtools solution is essentially just same as giving samtools a BED file, so there's probably no gain there.

ADD REPLYlink written 4.6 years ago by Devon Ryan92k
0
gravatar for nash.claire
3.7 years ago by
nash.claire350
Canada
nash.claire350 wrote:

Hi,

I've been struggling with this and don't want to repeat a question so therefore posting here.

I've tried all the solutions and none of them are working for me. Everytime I try these, if I then run samtools idxstats to see if it works, nothing changes. I even tried the long solution from matted of samtools view and typing out every single chromosome I want to keep (1-22 & X/Y) only to find that I just can't get rid of the chrM, all the chrUn etc etc. I also tried this 

samtools idxstats input.bam | grep -v chrUn_* | samtools view -b input.bam > output.bam 

which also didn't work to get rid of chrUn. So if starting with a bam file with all these chrUn_g*** and chrM etc etc in, how can I simply get rid of them? I'll happily take a dirty messy bash one liner if it gets rid of them. 

I know I'm probably doing something stupid as I'm a wet lab scientist and really don't know how to do bioinformatics properly. Unfortunately like others i'm being thrown into it with no time to learn properly and with no-one to teach me. I'm sure you real bioinformaticians out there are sick of newbies not having a clue. 

ADD COMMENTlink written 3.7 years ago by nash.claire350

you should ask this as a new question or explain what 'didn't work'

also what do you expect from:

samtools idxstats input.bam | grep -v chrUn_* | samtools view -b input.bam > output.bam 

as far as I know , the output of this is just the output of

samtools view -b input.bam > output.bam 

which is just a copy of input.bam to output.bam

 

 

 

 

 

 

ADD REPLYlink written 3.7 years ago by Pierre Lindenbaum123k

I don't really know what I expect as I clearly don't know what the hell I'm doing. This is what happens through teaching myself from forums. I can ask a new question but I know it annoys people when questions get repeated. What I really want is a command to use if I have a bam file that contains chrM, chrUn_gl***, chr*_gl****_random alignments in so as to get rid of said alignments and keep the "good stuff" I need. This appears to be the same as the question here but these solutions aren't working for me. And when I said it didn't work, I meant that every time I tried the solutions and then went back to samtools idxstats on the "new" bam files I'd created, absolutely nothing had changed from the original. It still contained all the chrM etc etc in. I appreciate your patience and thanks for your input. I'm afraid patience is something I've run out of with this stuff. 

ADD REPLYlink written 3.7 years ago by nash.claire350

looking at my answer A: Remove mitochondrial reads from BAM files

that would be:

java -jar samjs.jar  -e 'record.getReadUnmappedFlag() || !record.getReferenceName().equals("chrM") || !record.getReferenceName().contains("_gl")' in.bam > out.sam
ADD REPLYlink written 3.7 years ago by Pierre Lindenbaum123k

It looks like you forgot "xargs" before the second samtools command, in the example you posted.  It's an easy thing to get wrong, but it does matter here because it's what takes the valid chromosome names and uses them as filtering arguments in the samtools view call.  Otherwise, it's not filtering at all (since it doesn't get the chromosome name list), and the output will be the same as the input (as you see).

ADD REPLYlink written 3.7 years ago by matted7.1k

Hi thanks for the reply. I've managed to get the sed solution working but it's long and cumbersome so keen to use a quicker and easier way. So a couple of questions on your solution that uses xargs:-

-I've heard GNU parallel can replace xargs and I already have parallel. Can this be used instead of xargs in this solution somehow?

-If I wanted to remove chrM and chrUn etc etc at the same time in this command, how would it be written? I'm guessing grep -v something ???

-Is there any way to run this on multiple samples at the same time or do I have to run them one by one?

Thanks again. You guys might be saving me from throwing my mac out of a window. Along with my PI for piling bioinformatics on a cell biologist.

ADD REPLYlink written 3.7 years ago by nash.claire350

parallel vs. xargs: parallel is usually used to execute multiple commands at once on different parts of the input, which isn't what you want here.  There's probably a way to get it to work, but I'm not an expert with parallel.  On most systems, xargs is already installed (my Mac has xargs but not parallel, for instance).

multiple filtering: sure, you could combine multiple filtering steps with pipes (grep -v BAD1 | grep -v BAD2 | grep -v BAD3 ...).  It's probably cleaner to make a BED file with chromosomes (with coordinate ranges say 1 to 1000000000, to get everything) you want to keep, and use the samtools view -L option with that file.

You can definitely run this on multiple samples at once, and there are a lot of ways to do this (parallel would be good for this, or xargs, or read up on shell scripting with loops).

Overall, I would suggest building up what you're trying to do step by step, to make sure everything makes sense as you go along.  Unfortunately my original one-liner was ugly and complicated, and it fails in unexpected ways if you don't get it exactly right.  But basically the first part (samtools idxstats input.bam | cut -f 1 | grep -v MT) is only to build a list of chromosomes to keep, and the second part (with xargs) is just a trick to pass that list into samtools view. You can try simpler ways to make the list of valid chromosomes, and make sure that works first, perhaps on a small subset of the whole bam (see samtools view -s). Maybe also experiment with samtools view -c, which will just give the number of reads matching your filtering conditions. Then you can quickly see if your filtering is doing anything at all, or roughly matching what you'd expect.

ADD REPLYlink written 3.7 years ago by matted7.1k
1

Thank you so much for the advice. Much appreciated. Yep I think much reading is needed. Or a degree in bioinformatics would be great haha! 

ADD REPLYlink written 3.7 years ago by nash.claire350
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1985 users visited in the last hour