Question: Remove mitochondrial reads from BAM files
4
gravatar for enricoferrero
5.0 years ago by
enricoferrero780
United Kingdom
enricoferrero780 wrote:

Hi,

I have a somewhat high content of mitochondrial RNA in my RNA-seq experiment. Is there a way to use samtools to remove alignments to the 'MT' chromosome and keep all the rest?

I'm considering using samtools view in combination with awk but perhaps there's a better/cleaner solution?

Thanks!

awk rna-seq samtools bam • 17k views
ADD COMMENTlink modified 2.3 years ago by Simply Bioinformatics160 • written 5.0 years ago by enricoferrero780
21
gravatar for matted
5.0 years ago by
matted7.2k
Boston, United States
matted7.2k wrote:

You can use samtools to do it by selecting particular chromosomes and never leaving the binary format, for example:

samtools view -b input.bam chr1 chr2 chr3 chr4 > output.bam

That's a bit unwieldy to list all non-mitochondrial chromosomes manually, so you could try an ugly bash one-liner like:

samtools idxstats input.bam | cut -f 1 | grep -v MT | xargs samtools view -b input.bam > output.bam
ADD COMMENTlink modified 5 weeks ago by RamRS25k • written 5.0 years ago by matted7.2k

I noticed this answer after I added mine - much neater!

 

ADD REPLYlink written 5.0 years ago by Ian5.6k

I have a 10GB BAM file of human transcriptome, I split it into chromosome-wise and all the splited BAM file s' size is less than 10GB it means some data has been lost. Can you please explain it

ADD REPLYlink written 2.3 years ago by qudrat70

Check the number of reads per file, not the size, as it depends on the compression level.

ADD REPLYlink written 2.3 years ago by ATpoint28k

I'd count the lines, instead of looking at file sizes.

ADD REPLYlink written 2.3 years ago by swbarnes27.2k

It's also possible your original BAM kept the unmapped reads, and when you split the BAM into chromosomes the unmapped reads had nowhere to go.

ADD REPLYlink written 20 months ago by nick.carleson10

Hi, is there a way to remove all entries after chromosome Y? I am trying to do this in a bedgraph file but the same will suffice in a BAM file.

Thanks in advance.

ADD REPLYlink written 10 months ago by checkyodna20

so does this (samtools view -b input.bam chr1 chr2 chr3 chr4 > output.bam) remove the listed chrosomes or actually create a new file only with those chr?

ADD REPLYlink written 4 weeks ago by User000270
10
gravatar for Ian
5.0 years ago by
Ian5.6k
University of Manchester, UK
Ian5.6k wrote:

I routinely cleanse my SAM files of chrM, and unassembled "random" contigs before running ChIP-seq analysis.  I use 'sed' on the SAM file.  Although you could be clever and do this via 'samtools view' without the need for creating an intermediate SAM file :)

sed '/chrM/d;/random/d;/chrUn/d' < file.sam > file_filtered.sam

 

 

 

ADD COMMENTlink written 5.0 years ago by Ian5.6k

this approach worked the best for me. thanks!

ADD REPLYlink written 4.0 years ago by mrs.hope10
5
gravatar for Pierre Lindenbaum
5.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum125k wrote:

Using my tool samjs: https://github.com/lindenb/jvarkit/wiki/SamJS

java -jar samjs.jar  -e 'record.getReadUnmappedFlag() || !record.getReferenceName().equals("MT")' in.bam

 

ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by Pierre Lindenbaum125k

Pierre, I tried using your tool (well the Picardtools FilterSamReads) and it keeps failing with the error: ERROR 2016-08-31 12:31:56 FilterSamReads Failed to filter Normal.sort.rmdup.bam java.lang.IllegalStateException: Records B05AGABXX110313:4:1207:12805:19638 (chr1:10,001) should come after B0590ABXX110313:6:2208:18151:161276 (chr1:10,002) when sorting with htsjdk.samtools.SAMRecordQueryNameComparator

These bam files were sorted with Picardtools SortSam and duplicates removed. Any recommendations? Should I just try the tool on its own and not the Picardtools version? Thanks.

ADD REPLYlink written 3.4 years ago by Vanish00710

what's the version of picard ?

ADD REPLYlink written 3.4 years ago by Pierre Lindenbaum125k

Thanks for the reply, It's Picard 2.1.1

ADD REPLYlink written 3.4 years ago by Vanish00710

strange. It looks like your bam is wrongly sorted. No idea.

ADD REPLYlink written 3.4 years ago by Pierre Lindenbaum125k

Thanks. Yeah any idea on how or why a bam would be sorted incorrectly? I mean I used Picardtools to sort it. I suppose I could try sorting it with samtools...

ADD REPLYlink written 3.4 years ago by Vanish00710
3
gravatar for Simply Bioinformatics
2.3 years ago by
WashingtonDC
Simply Bioinformatics160 wrote:

This solution will keep the headers intact so you can extract fastq file

samtools view -h this.bam | awk '{if($3 != "chrM" && $3 != "chrUn"){print $0}}' | samtools view -Shb - > this.filter.bam
ADD COMMENTlink modified 10 months ago • written 2.3 years ago by Simply Bioinformatics160

shorter:

awk  '($3 != "chrM" && $3 != "chrUn")'
ADD REPLYlink written 2.3 years ago by Pierre Lindenbaum125k

faster:

mawk  '($3 != "chrM" && $3 != "chrUn")'

=)

ADD REPLYlink written 2.3 years ago by ATpoint28k

can i do it for multiple bam files in a folder using parallel ?

"samtools view -h this.bam | awk '{if($3 != "chrM" && $3 != "chrUn"){print $0}}' | samtools view -Shb - > this.filter.bam
ADD REPLYlink written 9 months ago by krushnach80630
2
gravatar for Devon Ryan
5.0 years ago by
Devon Ryan93k
Freiburg, Germany
Devon Ryan93k wrote:

Surprisingly, there's no way to do that with the stock version of samtools. You can do this with bedtools by intersecting with a BED file containing all of the chromosomes except MT. That might be faster than awk, depending on how bedtools implemented that.

BTW, in the off chance that you need to do this with a very large number of samples such that any awk/bedtools-based solution is too slow, this would be pretty easy to code in C with HTSlib (just let me know, I could write such a program in a couple minutes).

ADD COMMENTlink written 5.0 years ago by Devon Ryan93k

Of course, the bedtools solution is essentially just same as giving samtools a BED file, so there's probably no gain there.

ADD REPLYlink written 5.0 years ago by Devon Ryan93k
0
gravatar for nash.claire
4.1 years ago by
nash.claire380
Canada
nash.claire380 wrote:

Hi,

I've been struggling with this and don't want to repeat a question so therefore posting here.

I've tried all the solutions and none of them are working for me. Every time I try these, if I then run samtools idxstats to see if it works, nothing changes. I even tried the long solution from matted of samtools view and typing out every single chromosome I want to keep (1-22 & X/Y) only to find that I just can't get rid of the chrM, all the chrUn etc etc. I also tried this

samtools idxstats input.bam | grep -v chrUn_* | samtools view -b input.bam > output.bam

which also didn't work to get rid of chrUn. So if starting with a bam file with all these chrUn_g* and chrM etc etc in, how can I simply get rid of them? I'll happily take a dirty messy bash one liner if it gets rid of them.

I know I'm probably doing something stupid as I'm a wet lab scientist and really don't know how to do bioinformatics properly. Unfortunately like others i'm being thrown into it with no time to learn properly and with no-one to teach me. I'm sure you real bioinformaticians out there are sick of newbies not having a clue.

ADD COMMENTlink modified 5 weeks ago by RamRS25k • written 4.1 years ago by nash.claire380

you should ask this as a new question or explain what 'didn't work'

also what do you expect from:

samtools idxstats input.bam | grep -v chrUn_* | samtools view -b input.bam > output.bam

as far as I know , the output of this is just the output of

samtools view -b input.bam > output.bam

which is just a copy of input.bam to output.bam

ADD REPLYlink modified 5 weeks ago by RamRS25k • written 4.1 years ago by Pierre Lindenbaum125k

I don't really know what I expect as I clearly don't know what the hell I'm doing. This is what happens through teaching myself from forums. I can ask a new question but I know it annoys people when questions get repeated. What I really want is a command to use if I have a bam file that contains chrM, chrUn_gl***, chr*_gl****_random alignments in so as to get rid of said alignments and keep the "good stuff" I need. This appears to be the same as the question here but these solutions aren't working for me. And when I said it didn't work, I meant that every time I tried the solutions and then went back to samtools idxstats on the "new" bam files I'd created, absolutely nothing had changed from the original. It still contained all the chrM etc etc in. I appreciate your patience and thanks for your input. I'm afraid patience is something I've run out of with this stuff.

ADD REPLYlink modified 5 weeks ago by RamRS25k • written 4.1 years ago by nash.claire380

looking at my answer A: Remove mitochondrial reads from BAM files

that would be:

java -jar samjs.jar  -e 'record.getReadUnmappedFlag() || !record.getReferenceName().equals("chrM") || !record.getReferenceName().contains("_gl")' in.bam > out.sam
ADD REPLYlink modified 5 weeks ago by RamRS25k • written 4.1 years ago by Pierre Lindenbaum125k

It looks like you forgot "xargs" before the second samtools command, in the example you posted. It's an easy thing to get wrong, but it does matter here because it's what takes the valid chromosome names and uses them as filtering arguments in the samtools view call. Otherwise, it's not filtering at all (since it doesn't get the chromosome name list), and the output will be the same as the input (as you see).

ADD REPLYlink modified 5 weeks ago by RamRS25k • written 4.1 years ago by matted7.2k

Hi thanks for the reply. I've managed to get the sed solution working but it's long and cumbersome so keen to use a quicker and easier way. So a couple of questions on your solution that uses xargs:-

  • I've heard GNU parallel can replace xargs and I already have parallel. Can this be used instead of xargs in this solution somehow?

  • If I wanted to remove chrM and chrUn etc etc at the same time in this command, how would it be written? I'm guessing grep -v <something>???

  • Is there any way to run this on multiple samples at the same time or do I have to run them one by one?

Thanks again. You guys might be saving me from throwing my mac out of a window. Along with my PI for piling bioinformatics on a cell biologist.

ADD REPLYlink modified 5 weeks ago by RamRS25k • written 4.1 years ago by nash.claire380

parallel vs. xargs: parallel is usually used to execute multiple commands at once on different parts of the input, which isn't what you want here. There's probably a way to get it to work, but I'm not an expert with parallel. On most systems, xargs is already installed (my Mac has xargs but not parallel, for instance).

multiple filtering: sure, you could combine multiple filtering steps with pipes (grep -v BAD1 | grep -v BAD2 | grep -v BAD3 ...). It's probably cleaner to make a BED file with chromosomes (with coordinate ranges say 1 to 1000000000, to get everything) you want to keep, and use the samtools view -L option with that file.

You can definitely run this on multiple samples at once, and there are a lot of ways to do this (parallel would be good for this, or xargs, or read up on shell scripting with loops).

Overall, I would suggest building up what you're trying to do step by step, to make sure everything makes sense as you go along. Unfortunately my original one-liner was ugly and complicated, and it fails in unexpected ways if you don't get it exactly right. But basically the first part (samtools idxstats input.bam | cut -f 1 | grep -v MT) is only to build a list of chromosomes to keep, and the second part (with xargs) is just a trick to pass that list into samtools view. You can try simpler ways to make the list of valid chromosomes, and make sure that works first, perhaps on a small subset of the whole bam (see samtools view -s). Maybe also experiment with samtools view -c, which will just give the number of reads matching your filtering conditions. Then you can quickly see if your filtering is doing anything at all, or roughly matching what you'd expect.

ADD REPLYlink modified 5 weeks ago by RamRS25k • written 4.1 years ago by matted7.2k
1

Thank you so much for the advice. Much appreciated. Yep I think much reading is needed. Or a degree in bioinformatics would be great haha!

ADD REPLYlink modified 5 weeks ago by RamRS25k • written 4.1 years ago by nash.claire380
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1174 users visited in the last hour