Question: Find common reads between two bam files
gravatar for yp19
5 months ago by
yp1930 wrote:

Hi all! I'm trying to find common reads between two bam files. I first sorted the bam files by read name and then used comm ( Here is my code:

samtools sort -n unmapped_reads1.bam -o unmapped_reads1_sorted.bam
samtools sort -n unmapped_reads2.bam -o unmapped_reads2_sorted.bam

comm -12 unmapped_reads1_sorted.bam unmapped_reads2_sorted.bam > common_seqs.bam

Comm spits out these two messages:

comm: file 2 is not in sorted order

comm: file 1 is not in sorted order

and the resulting output file (common_seqs.bam) is empty. Anyone experienced this? or recommend any other ways to do this task?

Thank you!

sequencing bam reads • 297 views
ADD COMMENTlink modified 5 months ago by Pierre Lindenbaum122k • written 5 months ago by yp1930
gravatar for Pierre Lindenbaum
5 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

the sort order of samtools sort is not the same as the linux sort / comm

samtools sort calls strnum_cmpwhich is not a simple comparaison of bytes :

static int strnum_cmp(const char *_a, const char *_b)
    const unsigned char *a = (const unsigned char*)_a, *b = (const unsigned char*)_b;
    const unsigned char *pa = a, *pb = b;
    while (*pa && *pb) {
        if (isdigit(*pa) && isdigit(*pb)) {
            while (*pa == '0') ++pa;
            while (*pb == '0') ++pb;
            while (isdigit(*pa) && isdigit(*pb) && *pa == *pb) ++pa, ++pb;
            if (isdigit(*pa) && isdigit(*pb)) {
                int i = 0;
                while (isdigit(pa[i]) && isdigit(pb[i])) ++i;
                return isdigit(pa[i])? 1 : isdigit(pb[i])? -1 : (int)*pa - (int)*pb;
            } else if (isdigit(*pa)) return 1;
            else if (isdigit(*pb)) return -1;
            else if (pa - a != pb - b) return pa - a < pb - b? 1 : -1;
        } else {
            if (*pa != *pb) return (int)*pa - (int)*pb;
            ++pa; ++pb;
    return *pa? 1 : *pb? -1 : 0;

you want:

samtools view unmapped_reads1.bam |  cut -f1 | LC_ALL=C sort | uniq >   unmapped_reads1_sorted.bam
samtools view unmapped_reads2.bam |  cut -f1 | LC_ALL=C sort | uniq >   unmapped_reads2_sorted.bam

LC_ALL=C comm -12 unmapped_reads1_sorted.bam unmapped_reads2_sorted.bam > common_seqs.bam
ADD COMMENTlink written 5 months ago by Pierre Lindenbaum122k

When I use the sorting commands, my resulting sorted bam files are a series of lines like this (not the typical bam format with sequences of reads)

ADD REPLYlink modified 5 months ago • written 5 months ago by yp1930

yes, that's what you want if you want to use 'comm'...

ADD REPLYlink written 5 months ago by Pierre Lindenbaum122k

Oh i see! I should be more clear. I want my output to still be .bam just with all of the common sequences

ADD REPLYlink written 5 months ago by yp1930

once you have the common read names you can go back to the bam and get the reads: Extract reads from bam/sam files using read id

ADD REPLYlink written 5 months ago by Pierre Lindenbaum122k

I will try that thank you!!!

ADD REPLYlink written 5 months ago by yp1930

define "common sequences"

ADD REPLYlink written 5 months ago by Pierre Lindenbaum122k

my apologies for my bad explaining. I started off with paired end sequencing data. I mapped these sequences to two different reference genomes separately, and each time I extracted only the unmapped reads. (those two reference genomes represent possible contaminant species).

So now i have two unmapped reads files (unmapped_reads1.bam, unmapped_reads2.bam). I would need to keep sequences that are common to BOTH files as my final output, in bam format.

ADD REPLYlink written 5 months ago by yp1930
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2094 users visited in the last hour