Question: Find common reads between two bam files
gravatar for yp19
22 months ago by
yp1960 wrote:

Hi all! I'm trying to find common reads between two bam files. I first sorted the bam files by read name and then used comm ( Here is my code:

samtools sort -n unmapped_reads1.bam -o unmapped_reads1_sorted.bam
samtools sort -n unmapped_reads2.bam -o unmapped_reads2_sorted.bam

comm -12 unmapped_reads1_sorted.bam unmapped_reads2_sorted.bam > common_seqs.bam

Comm spits out these two messages:

comm: file 2 is not in sorted order

comm: file 1 is not in sorted order

and the resulting output file (common_seqs.bam) is empty. Anyone experienced this? or recommend any other ways to do this task?

Thank you!

sequencing bam reads • 1.0k views
ADD COMMENTlink modified 22 months ago by Pierre Lindenbaum133k • written 22 months ago by yp1960
gravatar for Pierre Lindenbaum
22 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum133k wrote:

the sort order of samtools sort is not the same as the linux sort / comm

samtools sort calls strnum_cmpwhich is not a simple comparaison of bytes :

static int strnum_cmp(const char *_a, const char *_b)
    const unsigned char *a = (const unsigned char*)_a, *b = (const unsigned char*)_b;
    const unsigned char *pa = a, *pb = b;
    while (*pa && *pb) {
        if (isdigit(*pa) && isdigit(*pb)) {
            while (*pa == '0') ++pa;
            while (*pb == '0') ++pb;
            while (isdigit(*pa) && isdigit(*pb) && *pa == *pb) ++pa, ++pb;
            if (isdigit(*pa) && isdigit(*pb)) {
                int i = 0;
                while (isdigit(pa[i]) && isdigit(pb[i])) ++i;
                return isdigit(pa[i])? 1 : isdigit(pb[i])? -1 : (int)*pa - (int)*pb;
            } else if (isdigit(*pa)) return 1;
            else if (isdigit(*pb)) return -1;
            else if (pa - a != pb - b) return pa - a < pb - b? 1 : -1;
        } else {
            if (*pa != *pb) return (int)*pa - (int)*pb;
            ++pa; ++pb;
    return *pa? 1 : *pb? -1 : 0;

you want:

samtools view unmapped_reads1.bam |  cut -f1 | LC_ALL=C sort | uniq >   unmapped_reads1_sorted.bam
samtools view unmapped_reads2.bam |  cut -f1 | LC_ALL=C sort | uniq >   unmapped_reads2_sorted.bam

LC_ALL=C comm -12 unmapped_reads1_sorted.bam unmapped_reads2_sorted.bam > common_seqs.bam
ADD COMMENTlink written 22 months ago by Pierre Lindenbaum133k

When I use the sorting commands, my resulting sorted bam files are a series of lines like this (not the typical bam format with sequences of reads)

ADD REPLYlink modified 22 months ago • written 22 months ago by yp1960

yes, that's what you want if you want to use 'comm'...

ADD REPLYlink written 22 months ago by Pierre Lindenbaum133k

Oh i see! I should be more clear. I want my output to still be .bam just with all of the common sequences

ADD REPLYlink written 22 months ago by yp1960

once you have the common read names you can go back to the bam and get the reads: Extract reads from bam/sam files using read id

ADD REPLYlink written 22 months ago by Pierre Lindenbaum133k

I will try that thank you!!!

ADD REPLYlink written 22 months ago by yp1960

define "common sequences"

ADD REPLYlink written 22 months ago by Pierre Lindenbaum133k

my apologies for my bad explaining. I started off with paired end sequencing data. I mapped these sequences to two different reference genomes separately, and each time I extracted only the unmapped reads. (those two reference genomes represent possible contaminant species).

So now i have two unmapped reads files (unmapped_reads1.bam, unmapped_reads2.bam). I would need to keep sequences that are common to BOTH files as my final output, in bam format.

ADD REPLYlink written 22 months ago by yp1960
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1935 users visited in the last hour