Bam To Bed With Read Nucleotide Sequences
6
2
Entering edit mode
8.5 years ago
rhasbunz ▴ 20

Dear All,

I have a relative simple question but i don't know how to solve this. I want to make a BED file from SAM or BAM incluiding nucleotide sequence of each mapped reads.

For example the BAM files contain genomic regions from chromosome 18 covered by short reads obtained from NGS-Illumina experiment (MRE-seq).

I need to create a BED file containing the following columns:

  • the first column is of type character and contains the chromosome of the region (e.g. chr1)
  • the second column is of type numeric and contains the start position of the mapped read
  • the third column is of type numeric and contains the stop position of the mapped read
  • the fourth column contains the nucleotide sequence of the mapped read

An example BED file is:

chr18 9954 10104 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTACCCTAACCC 0 -
chr18 10053 10203 TAACCCTAACCCTAACCCTAAACCCTAACCCTGAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT 0 -
chr18 10084 10234 CCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTTAACCCTAACCA 0 +
chr18 10112 10262 ACCCTAACCCTAACCCTAACCCTAACCCTTAACCTTAACCCTAACCCTTAACCCTAACCCTAACCCTAACCCTAA 0 +
chr18 10114 10264 CCTAACCCTAACCCTAACCCTAACCCTAACCCTTAACCCTAACCCTTAACCCTAACCCTAACCCTAACCCTAACC 0 +
chr18 10116 10266 TAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCT 0 +
chr18 10126 10276 CCTAACCCTAACCCTAACCCTAAACCCTAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC 0 +
chr18 10139 10289 CAACCCCAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA 0 -
chr18 10148 10298 AACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCCAA 0 +
bam bed sam • 4.8k views
ADD COMMENT
3
Entering edit mode
8.5 years ago

And just because I've never played around with pysam before (this will also handle indels and splicing):

#!/usr/bin/env python
import pysam
import sys

samfile = pysam.Samfile(sys.argv[1], "rb")
for read in samfile.fetch() :
    strand = '+'
    if(read.is_reverse) :
        strand = '-'
    print("%s\t%i\t%i\t%s\t%c" % (samfile.getrname(read.tid), read.pos, read.aend, read.seq, strand))

samfile.close()
ADD COMMENT
3
Entering edit mode
8.5 years ago

You can use bedtools bamtobed to do the first transform then some combination of simple scripts to iterate on both the sam and the bed file to replace the name column with the sequence. For example:

bedtools bamtobed -i data.bam > data.bed 
samtools view data.bam > data.sam
paste data.bed data.sam | awk -v "OFS=\t" '{print $1, $2, $3, $16, $6}' | head
ADD COMMENT
0
Entering edit mode

Hi Istvan, I applied your instructions and I was able to create the file I needed. Thank you very much for the help

ADD REPLY
0
Entering edit mode

samtools view -F 4 data.bam.

some reads in the samfile may not map to ref ,but bamtobed ignore it.

NC_000915.1 10 88 readname2 0 - readname1 117 NC_000915.1 11 0 *

NC_000915.1 10 65 readname3 0 - readname2 185 NC_000915.1 11 0 21S78M

ADD REPLY
1
Entering edit mode
8.5 years ago

Because no one else will decide to post a solution using the samtools C API (this has the benefit of supporting splicing, but there's probably a simpler solution in python or perl):

#include <bam.h>
#include <stdio.h>
#include <inttypes.h>

int main(int argc, char *argv[]) {
    char strand;
    bamFile fp;
    bam_header_t *header;
    bam1_t *read = bam_init1();
    int i, base;

    fp = bam_open(argv[1], "rb");
    header = bam_header_read(fp);
    while(bam_read1(fp, read) > 1) {
        strand = (read->core.flag & BAM_FREVERSE) ? '-' : '+';
        fprintf(stdout, "%s\t%"PRId32"\t%"PRId32"\t", header->target_name[read->core.tid], read->core.pos, bam_calend(&(read->core), bam1_cigar(read)));
        //Get the sequence
        for(i=0; i<read->core.l_qseq; i++) fprintf(stdout, "%c", bam_nt16_rev_table[bam1_seqi(bam1_seq(read), i)]);
        fprintf(stdout, "\t%c\n", strand);
    }
    bam_destroy1(read);
    bam_close(fp);
    return 0;
}

I fully expect this can be done in a single line with the javascript interface that Pierre wrote :)

ADD COMMENT
0
Entering edit mode
8.5 years ago

Well the solution is straightforward for majority of the reads but there are some reads that go through soft clipping and it is hard to identify their start and end position (theoretically it is possible if someone is a great programmer) from a bam file. Same is with reads that have insertions and deletions in them. The CIGAR sequence contains the information about all these events but as I said its hard to use CIGAR sequence. If you ignore these things then the solution should be straightforward.

samtools view input.bam |  awk '{print  $3"\t"$4"\t"$4+length($10)-1"\t"$10}'

I know this is not the perfect solution but I am just trying to help.

ADD COMMENT
0
Entering edit mode
8.5 years ago
Erik Garrison ★ 2.3k

Are you looking for anything more than:

samtools view file.bam | awk '{ print $3, $4, $4+length($10), $10 }' >sam.bed

? It has everything but the strand of the read.

ADD COMMENT
0
Entering edit mode

this works only if the sequence is an exact match

ADD REPLY

Login before adding your answer.

Traffic: 2217 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6