Question

Bam To Bed With Read Nucleotide Sequences

2

Entering edit mode

10.5 years ago

rhasbunz ▴ 30

Dear All,

I have a relative simple question but i don't know how to solve this. I want to make a BED file from SAM or BAM incluiding nucleotide sequence of each mapped reads.

For example the BAM ﬁles contain genomic regions from chromosome 18 covered by short reads obtained from NGS-Illumina experiment (MRE-seq).

I need to create a BED file containing the following columns:

the ﬁrst column is of type character and contains the chromosome of the region (e.g. chr1)
the second column is of type numeric and contains the start position of the mapped read
the third column is of type numeric and contains the stop position of the mapped read
the fourth column contains the nucleotide sequence of the mapped read

An example BED file is:

chr18 9954 10104 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTACCCTAACCC 0 -
chr18 10053 10203 TAACCCTAACCCTAACCCTAAACCCTAACCCTGAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT 0 -
chr18 10084 10234 CCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTTAACCCTAACCA 0 +
chr18 10112 10262 ACCCTAACCCTAACCCTAACCCTAACCCTTAACCTTAACCCTAACCCTTAACCCTAACCCTAACCCTAACCCTAA 0 +
chr18 10114 10264 CCTAACCCTAACCCTAACCCTAACCCTAACCCTTAACCCTAACCCTTAACCCTAACCCTAACCCTAACCCTAACC 0 +
chr18 10116 10266 TAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCT 0 +
chr18 10126 10276 CCTAACCCTAACCCTAACCCTAAACCCTAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC 0 +
chr18 10139 10289 CAACCCCAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA 0 -
chr18 10148 10298 AACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCCAA 0 +

bam bed sam • 6.7k views

ADD COMMENT • link updated 10.5 years ago by Istvan Albert 100k • written 10.5 years ago by rhasbunz ▴ 30

score 3 · Answer 1 · 2013-11-08

And just because I've never played around with pysam before (this will also handle indels and splicing):

#!/usr/bin/env python
import pysam
import sys

samfile = pysam.Samfile(sys.argv[1], "rb")
for read in samfile.fetch() :
    strand = '+'
    if(read.is_reverse) :
        strand = '-'
    print("%s\t%i\t%i\t%s\t%c" % (samfile.getrname(read.tid), read.pos, read.aend, read.seq, strand))

samfile.close()

score 3 · Answer 2 · 2013-11-08

3

Entering edit mode

10.5 years ago

Istvan Albert 100k

You can use bedtools bamtobed to do the first transform then some combination of simple scripts to iterate on both the sam and the bed file to replace the name column with the sequence. For example:

bedtools bamtobed -i data.bam > data.bed 
samtools view data.bam > data.sam
paste data.bed data.sam | awk -v "OFS=\t" '{print $1, $2, $3, $16, $6}' | head

ADD COMMENT • link 10.5 years ago by Istvan Albert 100k

0

Entering edit mode

Hi Istvan, I applied your instructions and I was able to create the file I needed. Thank you very much for the help

ADD REPLY • link 10.5 years ago by rhasbunz ▴ 30

0

Entering edit mode

samtools view -F 4 data.bam.

some reads in the samfile may not map to ref ,but bamtobed ignore it.

NC_000915.1 10 88 readname2 0 - readname1 117 NC_000915.1 11 0 *

NC_000915.1 10 65 readname3 0 - readname2 185 NC_000915.1 11 0 21S78M

ADD REPLY • link 4.4 years ago by tsy19900929 • 0

score 1 · Answer 3 · 2013-11-08

Because no one else will decide to post a solution using the samtools C API (this has the benefit of supporting splicing, but there's probably a simpler solution in python or perl):

#include <bam.h>
#include <stdio.h>
#include <inttypes.h>

int main(int argc, char *argv[]) {
    char strand;
    bamFile fp;
    bam_header_t *header;
    bam1_t *read = bam_init1();
    int i, base;

    fp = bam_open(argv[1], "rb");
    header = bam_header_read(fp);
    while(bam_read1(fp, read) > 1) {
        strand = (read->core.flag & BAM_FREVERSE) ? '-' : '+';
        fprintf(stdout, "%s\t%"PRId32"\t%"PRId32"\t", header->target_name[read->core.tid], read->core.pos, bam_calend(&(read->core), bam1_cigar(read)));
        //Get the sequence
        for(i=0; i<read->core.l_qseq; i++) fprintf(stdout, "%c", bam_nt16_rev_table[bam1_seqi(bam1_seq(read), i)]);
        fprintf(stdout, "\t%c\n", strand);
    }
    bam_destroy1(read);
    bam_close(fp);
    return 0;
}

I fully expect this can be done in a single line with the javascript interface that Pierre wrote :)

Istvan Albert · Answer 4 · 2013-11-08

Well the solution is straightforward for majority of the reads but there are some reads that go through soft clipping and it is hard to identify their start and end position (theoretically it is possible if someone is a great programmer) from a bam file. Same is with reads that have insertions and deletions in them. The CIGAR sequence contains the information about all these events but as I said its hard to use CIGAR sequence. If you ignore these things then the solution should be straightforward.

samtools view input.bam |  awk '{print  $3"\t"$4"\t"$4+length($10)-1"\t"$10}'

I know this is not the perfect solution but I am just trying to help.

score 0 · Answer 5 · 2013-11-08

0

Entering edit mode

10.5 years ago

Erik Garrison ★ 2.4k

Are you looking for anything more than:

samtools view file.bam | awk '{ print $3, $4, $4+length($10), $10 }' >sam.bed

? It has everything but the strand of the read.

ADD COMMENT • link 10.5 years ago by Erik Garrison ★ 2.4k

0

Entering edit mode

this works only if the sequence is an exact match

ADD REPLY • link 10.5 years ago by Istvan Albert 100k