Count 5'End Mapped To A Specific Genomic Position
7
6
Entering edit mode
10.6 years ago
ifreecell ▴ 220

I got several SAM/BAM files, and I am interested in 5'ends of the mapped reads. Is there any tools or scripts to count how many 5'ends are mapped at a specific genomic position?

N.B. I am not try to count the total reads number mapped to the specific genomic position.

For example, in the following window, I get 5 reads, 3 of them have a 5'ends at 14488, the other 2 have a 5'ends at 14487.

enter image description here

And I want produce a table like this.

enter image description here.

counts bam sam reads • 10k views
ADD COMMENT
4
Entering edit mode
9.7 years ago
ifreecell ▴ 220
bedtools genomecov -d -5 -ibam input.bam -g genome.fa > output.txt

More options see bedtools documents at http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html

ADD COMMENT
5
Entering edit mode
10.6 years ago
dario.garvan ▴ 530

Here is the simplest solution, using R.

library(Rsamtools)
allReads <- scanBam("your.bam")
firstBases <- resize(allReads, 1)
startCounts <- coverage(firstBases)

The resize function takes into account the strand of the reads.

ADD COMMENT
0
Entering edit mode

Hey Dario, I was working on a similar idea and I could not work out your way. I finally managed to make it work by changing the way you read your .bam file because resize() can only be applied to "GRanges" object:

allReads <- readGAlignments("your.bam")

allReads <- as(allReads, "GRanges")

firstBases <- resize(allReads, 1)

startCounts <- coverage(firstBases)

What I really want to find out however is a way to include another column that contains the nucleotide in each position, so we can have the count and also the reference nucleotide of the specific count? Is there any simple way to attach another column (ie. the reference in .fasta), or maintain the info while scanning the .bam file?

ADD REPLY
0
Entering edit mode

Yes, functions sometimes become defunct and other functions provide the same functionality. To get the TSS nucleotide, you can use getSeq from BSgenome with a relevant BSgenome object (eg. Hsapiens in BSgenome.Hsapiens.UCSC.hg19).

ADD REPLY
2
Entering edit mode
10.6 years ago
Dan D 7.4k

My initial thought is to use bedtools bamtobed to make a BED file from your BAMs. Then, it's easy to use coding strand data and the start/end positions to get what you want.

Here's my reasoning:

Let's say you have the following segment of a BED file:

chr7    127471196  127472363  Pos1  0  +
chr7    127472363  127473530  Pos2  0  +
chr7    127473530  127474697  Pos3  0  +
chr7    127474697  127475864  Pos4  0  +
chr7    127475864  127477031  Neg1  0  -
chr7    127477031  127478198  Neg2  0  -
chr7    127478198  127479365  Neg3  0  -
chr7    127479365  127480532  Pos5  0  +
chr7    127480532  127481699  Neg4  0  -

Pos1 has its 5' end at the number in column 2: 127471196

Neg1 has its 5' end at the number in column 3: 127477031

In that way it's pretty easy to create a table of 5' starting points based on the BED info. Let me know if you need coding help.

EDIT: Here's a script I busted out which will give you what you need. It takes a BED file through STDIN and spits a two-column tab-delimited table to STDOUT. Call it like this: perl scriptName.pl < [yourBed] > [yourTableFile]

#!/usr/bin/perl

use warnings;
use strict;
use Data::Dumper;

my $lineCounter = 1;
my %startPoints;

while(<>){
        my @columns = split /\t/, $_;
        my $startPoint;
        if($columns[5] eq '+'){
                $startPoint = $columns[1];
                ++$startPoints{$startPoint};
        } elsif ($columns[5] eq '-') {
                $startPoint = $columns[2];
                ++$startPoints{$startPoint};
        } else {
                die "Goofy BED Data on line $lineCounter:\n$_";
        }
        ++$lineCounter;
}

for my $key (sort {$a <=> $b} keys %startPoints) {
        print "$key\t$startPoints{$key}\n";
}

<3

ADD COMMENT
0
Entering edit mode

Thank you @Deedee, It would be really nice of you if you can paste the script. By the way, does the value of the fifth column in the bed file matter? In one of my bed file, only 53 reads have the value 0, while 15011 have the value 60.

ADD REPLY
0
Entering edit mode

The fifth column is the "score" column and its usage is context-sensitive. Some tools will put read depth here, or some sort of alignment certainty metric, or copy number, or the like. I'll write up a quick script for you and post it as an edit to my answer.

ADD REPLY
0
Entering edit mode

Hi Deedee, Thank you. Your script was exactly what I expected, except I need to tweak the BED files which were produced by "bedtools bamtobed -i input.bam > output.bed".

The script terminated at the first line, but it worked perfectly after I manually appended < TAB> to the last column of bed files.

I am using OS X 10.7, with perl v5.12.3, and BEDTools v2.17.0‎. BAM files were exported from CLC Genomics Workbench on Windows 7 machine.

ADD REPLY
1
Entering edit mode
10.6 years ago

as the location of the read is the 5' position of the clipped read, you could use the following command:

samtools view -F 4  sorted.bam  | cut -d '   ' -f3,4 | uniq -c

.

$ samtools view -F 4 examples/ex1_sorted.bam  | cut -d '   ' -f3,4 | uniq -c |head
      1 seq1    1
      1 seq1    3
      1 seq1    5
      1 seq1    6
      1 seq1    9
      2 seq1    13
      1 seq1    15
      1 seq1    18
      2 seq1    22
      1 seq1    24
ADD COMMENT
2
Entering edit mode

If the reads are on the reverse strand you won't have the 5' end but the 3' end. In that case you'll have to add the read length to the 3' end position to get the 5'end position

ADD REPLY
0
Entering edit mode

@Xtof I think you're wrong: http://samtools.sourceforge.net/SAMv1.pdf "POS : 1-based leftmost mapping POSition of the rst matching base. The rst base in a reference sequence has coordinate 1"

ADD REPLY
0
Entering edit mode

@Xtof ..ah ok... unless you're talking about the 5' end of the read itlself (!= 5' of alignment)

ADD REPLY
0
Entering edit mode

This is interesting. Let me see if I understand this: First you're using the -F flag to filter out any unmapped reads. You're piping the SAM data into the cut command, which specifies a tab delimiter and yanks out the contig (chromosome) name and start position from each SAM read. Then you're using uniq to count unique combinations of the two columns. Pretty rad!

ADD REPLY
1
Entering edit mode
9.7 years ago
Charles Plessy ★ 2.9k

Have a look to the python script level1.py in the PromoterPipeline written by Michiel de Hoon and released in the supporting data of our article in BMC Genomics (Harbers et al., 2014). It can take mutiple BAM files directly as input and make an expression table of 5′ ends. It is written in Python and needs Pysam, that you can find for instance as a package on Debian systems.

ADD COMMENT
1
Entering edit mode
8.8 years ago

Pileup from the BBMap package can do this:

pileup.sh in=file.sam basecov=coverage.txt startcov=t

Without the "startcov=t" flag, it will give the coverage at each position; with that flag, it just gives the count of reads starting at each position.

ADD COMMENT
0
Entering edit mode
10.6 years ago
ifreecell ▴ 220

I add a block of opendir code to @Deedee's code. I post it here in case I need it someday.

#! /usr/bin/perl
# cd to your BED files directory before run this script

use strict;
use warnings;

# It will read all files ended with bed under current directory.
opendir(DIR, ".");
my @files = sort(grep(/bed$/, readdir(DIR)));
closedir(DIR);

my $i = 0;

foreach (@files) {
      open(IN, "<$_") or die "fail to read file  $_\n";
      open(OUT, ">$_.txt") or die "fail to write file $_.out\n";

      print "Start working on $files[$i]...\n"; $i++;

      my $lineCounter = 1;
      my %startPoints;

      while(<IN>) {
              chomp;
              my @columns = split("\t", $_);
              my $startPoint;
              if($columns[5] eq '+') {
                      $startPoint = $columns[1];
                      ++$startPoints{$startPoint};
              } elsif($columns[5] eq '-') {
                      $startPoint = $columns[2];
                      ++$startPoints{$startPoint};
              } else{
                      die "Error $files[$i] on line $lineCounter:\n$_";
              }   
              ++$lineCounter;
      }   

      for my $key (sort {$a <=> $b} keys %startPoints) {
              print OUT "$key\t$startPoints{$key}\n";
      }   

      close IN; 
      close OUT;
}
ADD COMMENT

Login before adding your answer.

Traffic: 1915 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6