Question

Find common reads between two bam files

0

Entering edit mode

5.1 years ago

yp19 ▴ 70

Hi all! I'm trying to find common reads between two bam files. I first sorted the bam files by read name and then used comm (https://linux.die.net/man/1/comm). Here is my code:

samtools sort -n unmapped_reads1.bam -o unmapped_reads1_sorted.bam
samtools sort -n unmapped_reads2.bam -o unmapped_reads2_sorted.bam

comm -12 unmapped_reads1_sorted.bam unmapped_reads2_sorted.bam > common_seqs.bam

Comm spits out these two messages:

comm: file 2 is not in sorted order

comm: file 1 is not in sorted order

and the resulting output file (common_seqs.bam) is empty. Anyone experienced this? or recommend any other ways to do this task?

Thank you!

sequencing bam reads • 3.0k views

ADD COMMENT • link updated 5.1 years ago by Pierre Lindenbaum 161k • written 5.1 years ago by yp19 ▴ 70

score 2 · Accepted Answer · 2019-03-07

2

Entering edit mode

5.1 years ago

Pierre Lindenbaum 161k

the sort order of samtools sort is not the same as the linux sort / comm

https://github.com/samtools/samtools/blob/develop/bam_sort.c#L1840

samtools sort calls strnum_cmpwhich is not a simple comparaison of bytes :

static int strnum_cmp(const char *_a, const char *_b)
{
    const unsigned char *a = (const unsigned char*)_a, *b = (const unsigned char*)_b;
    const unsigned char *pa = a, *pb = b;
    while (*pa && *pb) {
        if (isdigit(*pa) && isdigit(*pb)) {
            while (*pa == '0') ++pa;
            while (*pb == '0') ++pb;
            while (isdigit(*pa) && isdigit(*pb) && *pa == *pb) ++pa, ++pb;
            if (isdigit(*pa) && isdigit(*pb)) {
                int i = 0;
                while (isdigit(pa[i]) && isdigit(pb[i])) ++i;
                return isdigit(pa[i])? 1 : isdigit(pb[i])? -1 : (int)*pa - (int)*pb;
            } else if (isdigit(*pa)) return 1;
            else if (isdigit(*pb)) return -1;
            else if (pa - a != pb - b) return pa - a < pb - b? 1 : -1;
        } else {
            if (*pa != *pb) return (int)*pa - (int)*pb;
            ++pa; ++pb;
        }
    }
    return *pa? 1 : *pb? -1 : 0;
}

you want:

samtools view unmapped_reads1.bam |  cut -f1 | LC_ALL=C sort | uniq >   unmapped_reads1_sorted.bam
samtools view unmapped_reads2.bam |  cut -f1 | LC_ALL=C sort | uniq >   unmapped_reads2_sorted.bam

LC_ALL=C comm -12 unmapped_reads1_sorted.bam unmapped_reads2_sorted.bam > common_seqs.bam

ADD COMMENT • link 5.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

When I use the sorting commands, my resulting sorted bam files are a series of lines like this (not the typical bam format with sequences of reads)

MN00234:32:000H2LGW3:1:11101:10000:11153
MN00234:32:000H2LGW3:1:11101:10000:15424
MN00234:32:000H2LGW3:1:11101:10000:17282
MN00234:32:000H2LGW3:1:11101:10000:18247
MN00234:32:000H2LGW3:1:11101:10000:6127
MN00234:32:000H2LGW3:1:11101:10000:6852
MN00234:32:000H2LGW3:1:11101:10000:7439
MN00234:32:000H2LGW3:1:11101:10001:11174
MN00234:32:000H2LGW3:1:11101:10001:16085
MN00234:32:000H2LGW3:1:11101:10001:18520
MN00234:32:000H2LGW3:1:11101:10002:12632
MN00234:32:000H2LGW3:1:11101:10002:13716
MN00234:32:000H2LGW3:1:11101:10002:14294
MN00234:32:000H2LGW3:1:11101:10002:14998
MN00234:32:000H2LGW3:1:11101:10002:16729
MN00234:32:000H2LGW3:1:11101:10002:17066
MN00234:32:000H2LGW3:1:11101:10002:6528
MN00234:32:000H2LGW3:1:11101:10002:7756

ADD REPLY • link 5.1 years ago by yp19 ▴ 70

0

Entering edit mode

yes, that's what you want if you want to use 'comm'...

ADD REPLY • link 5.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Oh i see! I should be more clear. I want my output to still be .bam just with all of the common sequences

ADD REPLY • link 5.1 years ago by yp19 ▴ 70

1

Entering edit mode

once you have the common read names you can go back to the bam and get the reads: Extract reads from bam/sam files using read id

ADD REPLY • link 5.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I will try that thank you!!!

ADD REPLY • link 5.1 years ago by yp19 ▴ 70

0

Entering edit mode

define "common sequences"

ADD REPLY • link 5.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

my apologies for my bad explaining. I started off with paired end sequencing data. I mapped these sequences to two different reference genomes separately, and each time I extracted only the unmapped reads. (those two reference genomes represent possible contaminant species).

So now i have two unmapped reads files (unmapped_reads1.bam, unmapped_reads2.bam). I would need to keep sequences that are common to BOTH files as my final output, in bam format.

ADD REPLY • link 5.1 years ago by yp19 ▴ 70