Question: Find common reads between two bam files
0
gravatar for yl2019
6 weeks ago by
yl201910
yl201910 wrote:

Hi all! I'm trying to find common reads between two bam files. I first sorted the bam files by read name and then used comm (https://linux.die.net/man/1/comm). Here is my code:

samtools sort -n unmapped_reads1.bam -o unmapped_reads1_sorted.bam
samtools sort -n unmapped_reads2.bam -o unmapped_reads2_sorted.bam

comm -12 unmapped_reads1_sorted.bam unmapped_reads2_sorted.bam > common_seqs.bam

Comm spits out these two messages:

comm: file 2 is not in sorted order

comm: file 1 is not in sorted order

and the resulting output file (common_seqs.bam) is empty. Anyone experienced this? or recommend any other ways to do this task?

Thank you!

sequencing bam reads • 139 views
ADD COMMENTlink modified 6 weeks ago by Pierre Lindenbaum118k • written 6 weeks ago by yl201910
2
gravatar for Pierre Lindenbaum
6 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

the sort order of samtools sort is not the same as the linux sort / comm

https://github.com/samtools/samtools/blob/develop/bam_sort.c#L1840

samtools sort calls strnum_cmpwhich is not a simple comparaison of bytes :

static int strnum_cmp(const char *_a, const char *_b)
{
    const unsigned char *a = (const unsigned char*)_a, *b = (const unsigned char*)_b;
    const unsigned char *pa = a, *pb = b;
    while (*pa && *pb) {
        if (isdigit(*pa) && isdigit(*pb)) {
            while (*pa == '0') ++pa;
            while (*pb == '0') ++pb;
            while (isdigit(*pa) && isdigit(*pb) && *pa == *pb) ++pa, ++pb;
            if (isdigit(*pa) && isdigit(*pb)) {
                int i = 0;
                while (isdigit(pa[i]) && isdigit(pb[i])) ++i;
                return isdigit(pa[i])? 1 : isdigit(pb[i])? -1 : (int)*pa - (int)*pb;
            } else if (isdigit(*pa)) return 1;
            else if (isdigit(*pb)) return -1;
            else if (pa - a != pb - b) return pa - a < pb - b? 1 : -1;
        } else {
            if (*pa != *pb) return (int)*pa - (int)*pb;
            ++pa; ++pb;
        }
    }
    return *pa? 1 : *pb? -1 : 0;
}

you want:

samtools view unmapped_reads1.bam |  cut -f1 | LC_ALL=C sort | uniq >   unmapped_reads1_sorted.bam
samtools view unmapped_reads2.bam |  cut -f1 | LC_ALL=C sort | uniq >   unmapped_reads2_sorted.bam

LC_ALL=C comm -12 unmapped_reads1_sorted.bam unmapped_reads2_sorted.bam > common_seqs.bam
ADD COMMENTlink written 6 weeks ago by Pierre Lindenbaum118k

When I use the sorting commands, my resulting sorted bam files are a series of lines like this (not the typical bam format with sequences of reads)

MN00234:32:000H2LGW3:1:11101:10000:11153
MN00234:32:000H2LGW3:1:11101:10000:15424
MN00234:32:000H2LGW3:1:11101:10000:17282
MN00234:32:000H2LGW3:1:11101:10000:18247
MN00234:32:000H2LGW3:1:11101:10000:6127
MN00234:32:000H2LGW3:1:11101:10000:6852
MN00234:32:000H2LGW3:1:11101:10000:7439
MN00234:32:000H2LGW3:1:11101:10001:11174
MN00234:32:000H2LGW3:1:11101:10001:16085
MN00234:32:000H2LGW3:1:11101:10001:18520
MN00234:32:000H2LGW3:1:11101:10002:12632
MN00234:32:000H2LGW3:1:11101:10002:13716
MN00234:32:000H2LGW3:1:11101:10002:14294
MN00234:32:000H2LGW3:1:11101:10002:14998
MN00234:32:000H2LGW3:1:11101:10002:16729
MN00234:32:000H2LGW3:1:11101:10002:17066
MN00234:32:000H2LGW3:1:11101:10002:6528
MN00234:32:000H2LGW3:1:11101:10002:7756
ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by yl201910

yes, that's what you want if you want to use 'comm'...

ADD REPLYlink written 6 weeks ago by Pierre Lindenbaum118k

Oh i see! I should be more clear. I want my output to still be .bam just with all of the common sequences

ADD REPLYlink written 6 weeks ago by yl201910
1

once you have the common read names you can go back to the bam and get the reads: Extract reads from bam/sam files using read id

ADD REPLYlink written 6 weeks ago by Pierre Lindenbaum118k

I will try that thank you!!!

ADD REPLYlink written 6 weeks ago by yl201910

define "common sequences"

ADD REPLYlink written 6 weeks ago by Pierre Lindenbaum118k

my apologies for my bad explaining. I started off with paired end sequencing data. I mapped these sequences to two different reference genomes separately, and each time I extracted only the unmapped reads. (those two reference genomes represent possible contaminant species).

So now i have two unmapped reads files (unmapped_reads1.bam, unmapped_reads2.bam). I would need to keep sequences that are common to BOTH files as my final output, in bam format.

ADD REPLYlink written 6 weeks ago by yl201910
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1135 users visited in the last hour