Question: Bam To Bed With Read Nucleotide Sequences
2
gravatar for rhasbunz
7.2 years ago by
rhasbunz20
Chile
rhasbunz20 wrote:

Dear All,

I have a relative simple question but i don't know how to solve this. I want to make a BED file from SAM or BAM incluiding nucleotide sequence of each mapped reads.

For example the BAM files contain genomic regions from chromosome 18 covered by short reads obtained from NGS-Illumina experiment (MRE-seq).

I need to create a BED file containing the following columns:

  • the first column is of type character and contains the chromosome of the region (e.g. chr1)
  • the second column is of type numeric and contains the start position of the mapped read
  • the third column is of type numeric and contains the stop position of the mapped read
  • the fourth column contains the nucleotide sequence of the mapped read

An example BED file is:

chr18 9954 10104 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTACCCTAACCC 0 -
chr18 10053 10203 TAACCCTAACCCTAACCCTAAACCCTAACCCTGAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT 0 -
chr18 10084 10234 CCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTTAACCCTAACCA 0 +
chr18 10112 10262 ACCCTAACCCTAACCCTAACCCTAACCCTTAACCTTAACCCTAACCCTTAACCCTAACCCTAACCCTAACCCTAA 0 +
chr18 10114 10264 CCTAACCCTAACCCTAACCCTAACCCTAACCCTTAACCCTAACCCTTAACCCTAACCCTAACCCTAACCCTAACC 0 +
chr18 10116 10266 TAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCT 0 +
chr18 10126 10276 CCTAACCCTAACCCTAACCCTAAACCCTAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC 0 +
chr18 10139 10289 CAACCCCAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA 0 -
chr18 10148 10298 AACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCCAA 0 +
bed bam sam • 4.0k views
ADD COMMENTlink modified 7.2 years ago by Istvan Albert ♦♦ 86k • written 7.2 years ago by rhasbunz20
3
gravatar for Devon Ryan
7.2 years ago by
Devon Ryan98k
Freiburg, Germany
Devon Ryan98k wrote:

And just because I've never played around with pysam before (this will also handle indels and splicing):

#!/usr/bin/env python
import pysam
import sys

samfile = pysam.Samfile(sys.argv[1], "rb")
for read in samfile.fetch() :
    strand = '+'
    if(read.is_reverse) :
        strand = '-'
    print("%s\t%i\t%i\t%s\t%c" % (samfile.getrname(read.tid), read.pos, read.aend, read.seq, strand))

samfile.close()
ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by Devon Ryan98k
3
gravatar for Istvan Albert
7.2 years ago by
Istvan Albert ♦♦ 86k
University Park, USA
Istvan Albert ♦♦ 86k wrote:

You can use bedtools bamtobed to do the first transform then some combination of simple scripts to iterate on both the sam and the bed file to replace the name column with the sequence. For example:

bedtools bamtobed -i data.bam > data.bed 
samtools view data.bam > data.sam
paste data.bed data.sam | awk -v "OFS=\t" '{print $1, $2, $3, $16, $6}' | head
ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by Istvan Albert ♦♦ 86k

Hi Istvan, I applied your instructions and I was able to create the file I needed. Thank you very much for the help

ADD REPLYlink written 7.2 years ago by rhasbunz20

samtools view -F 4 data.bam.

some reads in the samfile may not map to ref ,but bamtobed ignore it.

NC_000915.1 10 88 readname2 0 - readname1 117 NC_000915.1 11 0 *

NC_000915.1 10 65 readname3 0 - readname2 185 NC_000915.1 11 0 21S78M

ADD REPLYlink modified 14 months ago • written 14 months ago by tsy199009290
1
gravatar for Devon Ryan
7.2 years ago by
Devon Ryan98k
Freiburg, Germany
Devon Ryan98k wrote:

Because no one else will decide to post a solution using the samtools C API (this has the benefit of supporting splicing, but there's probably a simpler solution in python or perl):

#include <bam.h>
#include <stdio.h>
#include <inttypes.h>

int main(int argc, char *argv[]) {
    char strand;
    bamFile fp;
    bam_header_t *header;
    bam1_t *read = bam_init1();
    int i, base;

    fp = bam_open(argv[1], "rb");
    header = bam_header_read(fp);
    while(bam_read1(fp, read) > 1) {
        strand = (read->core.flag & BAM_FREVERSE) ? '-' : '+';
        fprintf(stdout, "%s\t%"PRId32"\t%"PRId32"\t", header->target_name[read->core.tid], read->core.pos, bam_calend(&(read->core), bam1_cigar(read)));
        //Get the sequence
        for(i=0; i<read->core.l_qseq; i++) fprintf(stdout, "%c", bam_nt16_rev_table[bam1_seqi(bam1_seq(read), i)]);
        fprintf(stdout, "\t%c\n", strand);
    }
    bam_destroy1(read);
    bam_close(fp);
    return 0;
}

I fully expect this can be done in a single line with the javascript interface that Pierre wrote :)

ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by Devon Ryan98k
0
gravatar for Ashutosh Pandey
7.2 years ago by
Philadelphia
Ashutosh Pandey12k wrote:

Well the solution is straightforward for majority of the reads but there are some reads that go through soft clipping and it is hard to identify their start and end position (theoretically it is possible if someone is a great programmer) from a bam file. Same is with reads that have insertions and deletions in them. The CIGAR sequence contains the information about all these events but as I said its hard to use CIGAR sequence. If you ignore these things then the solution should be straightforward.

samtools view input.bam |  awk '{print  $3"\t"$4"\t"$4+length($10)-1"\t"$10}'

I know this is not the perfect solution but I am just trying to help.

ADD COMMENTlink modified 7.2 years ago by Istvan Albert ♦♦ 86k • written 7.2 years ago by Ashutosh Pandey12k
0
gravatar for Erik Garrison
7.2 years ago by
Erik Garrison2.3k
Napoli, IT / UCSC
Erik Garrison2.3k wrote:

Are you looking for anything more than:

samtools view file.bam | awk '{ print $3, $4, $4+length($10), $10 }' >sam.bed

? It has everything but the strand of the read.

ADD COMMENTlink written 7.2 years ago by Erik Garrison2.3k

this works only if the sequence is an exact match

ADD REPLYlink written 7.2 years ago by Istvan Albert ♦♦ 86k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1973 users visited in the last hour
_