Tool to unmark duplicates
3
3
Entering edit mode
9.8 years ago
Mark Ebbert ▴ 90

Hi,

I received some bams where duplicates were marked using Picard. I'd like to unmark all duplicate reads for a study we're doing. Does anyone know of an existing tool? I haven't found anything in google land.

If there isn't an existing tool, could you provide some guidance how to unmark them myself? I've read the SAM spec and understand I need to change the bit flag, but I'm not sure the best way to approach it.

Thanks!

Mark

MarkDuplicates PicardTools PCR-duplicates SamTools • 6.7k views
ADD COMMENT
0
Entering edit mode

Hello me.mark!

It appears that your post has been cross-posted to another site: SEQanswers

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
0
Entering edit mode

Thank you for the reminder. My colleague and I were both looking for a solution and it looks like he also posted a question about it. Please excuse the duplicate post.

Edit: Thank you everyone for your helpful responses!

ADD REPLY
0
Entering edit mode

That explains the very different usernames, no worries!

ADD REPLY
0
Entering edit mode

This post explains how to do this using Picard: Remove flags of MarkDuplicates (picard)

ADD REPLY
4
Entering edit mode
9.8 years ago

as said Devon, writing this tool is very easy ;-)

I wrote it and put it here: https://github.com/lindenb/jvarkit/wiki/Biostar106668

Example:

$ curl -s "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase1/technical/ncbi_varpipe_data/alignment/NA12751/NA12751.chrom20.ILLUMINA.mosaik.CEU.low_coverage.20101123.bam" |\
samtools view -h - |\
head -n 10000 |\
java -jar dist/biostar106668.jar -b > out.bam
ADD COMMENT
1
Entering edit mode

I guess my "has written" should have been "will shortly write" :)

ADD REPLY
7
Entering edit mode
9.8 years ago

Here is something you can try. I haven't tested it but I guess it should work. Let me know if you get any error while reconverting to bam.

samtools view -h input.bam | \
  awk '{
    printf "%s\t", $1;
    if(and($2,0x400)){t=$2-1024}
      else{t=$2};
    printf "%s\t" , t;
    for (i=3; i<NF; i++){printf "%s\t", $i};
    printf "%s\n",$NF}' | \
  samtools view -Sb - > input.unmarked.bam
ADD COMMENT
0
Entering edit mode

It should be noted that the awk and() function doesn't always exist (e.g., on OS X).

ADD REPLY
0
Entering edit mode

It should also be noted that "you can't do binf on a mac"

ADD REPLY
0
Entering edit mode

I wouldn't go that far. OS X is essentially a BSD variant with a really nice user interface. The and() function is just a difference between gawk and mawk and one can always install the other one.

ADD REPLY
0
Entering edit mode

you could shorten this by just setting $2-=1024 in your if and then printing $0.

ADD REPLY
0
Entering edit mode

The section 6.1 of this tutorial from the GATK team uses a modified version of your code to remove the 0x1 flag, but i get different types of syntax errors. When I instead try to use your exact code except for the flag code, I get:

gawk: cmd. line:1: (FILENAME=- FNR=1) fatal: not enough arguments to satisfy format string
    `%s @HD'
     ^ ran out for this one
[main_samview] fail to read the header from "-".

How could I fix this?

ADD REPLY
1
Entering edit mode
9.8 years ago

I'm not familiar with any existing tool (I wouldn't be surprised if Pierre Lindenbaum has written such a tool), but this should be simple enough to code. Depending on the version of awk you have on you system, you may simply be able to use it. In some implementations of awk (not all of them), the command if(and($2,1024)){$2-=1024} will unmark a duplicate. You would simply samtools view -h foo.bam | awk ...stuff... | samtools view -Sb - > foo.unmarked.bam.

The other simple way to do this is with python using pysam. Pysam allows reading from and writing to SAM/BAM files from within python, so you could perform a similar operation there.

ADD COMMENT

Login before adding your answer.

Traffic: 2437 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6