Sam To Bed File With Counter Column
3
1
Entering edit mode
8.8 years ago
lsvijfhuizen ▴ 90

Dear All,

I have a relative simple question but i don't know how to solve this. I want to change a SAM file to a BED file. The only thing is that I need to have a BED file with an extra column telling me how many times the tag mapped to a genomic position. The first column is the sequence, and the second column tells how many times a the sequence is present in the file. For example sequence GGGGGGGGG is present 6 times in the file on different locations.

#ID      locations   chromosome    strand  start   end        count
AAAAAAAAA 1          chr12   +          105579297      105579321      1
AAAAAAAAB 1          chr8     +          95642182        95642206        1
GGGGGGGGG 6          chr13   +          66975161        66975185        1
GGGGGGGGG 6          chr13   -           72592620        72592644        1
GGGGGGGGG 6          chr14   -           46332831        46332855        1
GGGGGGGGG 6          chr19   -           32540873        32540897        1
GGGGGGGGG 6          chr1     -           113777719      113777743      1
GGGGGGGGG 6          chr2     +          70297183        70297207        1

sam bed file sequence • 11k views
0
Entering edit mode

would you provide the patterns (e.g.: 'GGGGGGGGG') as an argument of the program ?

0
Entering edit mode

Dear Pierre Lindenbaum, Yes you are right. I think change SAM to BED is an useful first step. After that, I need a kind of counter that count how many of each #ID is present in the file and add that to the file.

6
Entering edit mode
8.8 years ago

The sam2bed helper script will get you halfway there:

$sam2bed < reads.sam > sorted-reads.bed  You'd then feed this BED file to a Perl or other script that keeps a hash table of counts, e.g.: #!/usr/bin/env perl use strict; use warnings; my$inFn = "sorted-reads.bed";
my $outFn = "sorted-reads-with-counts.bed"; my$countRef;
open IN, "< $inFn" or die "could not open$inFn\n";
while (<IN>) {
chomp;
my @elems = split('\t', $_); my$strand = $elems[5]; my$tag = $elems[9]; if (defined$countRef->{$strand}->{$tag}) {
$countRef->{$strand}->{$tag}++; } else {$countRef->{$strand}->{$tag} = 1;
}
}
close IN;
open IN, "< $inFn" or die "could not open$inFn\n";
open OUT, "> $outFn" or die "could not open handle to$outFn\n";
while (<IN>) {
chomp;
my @elems = split('\t', $_); my$strand = $elems[5]; my$tag = $elems[9]; print OUT "$_\t$countRef->{$strand}->{\$tag}\n";
}
close OUT;
close IN;


Output is sent to the file sorted-reads-with-counts.bed, which is the converted BED file with an additional column containing strand-specific counts by tag sequence. You could modify the script to format results, as needed.

0
Entering edit mode

Thank for your answer Alex Reynolds. I haveo nly one stupid question.. The sam2bed.csh file isn't running on my linux command line. Thats because the the following error --> bbms: Command not found. Did you see this before?

0
Entering edit mode

You need to install the bbms script - it's part of the BEDOPS suite, where the sam2bed script comes from. See: http://code.google.com/p/bedops/wiki/sortBed and http://code.google.com/p/bedops/ for downloads. Note: the bbms application is now deprecated in BEDOPS 2; please use sort-bed directly.

0
Entering edit mode

Note: I made a change to the Perl script code which addresses strand-specificity.

4
Entering edit mode
8.3 years ago
Tulip Nandu ▴ 80

The correct code is:

samtools view -Sb bowtie.sam > bowtie.bam
bamToBed -i bowtie.bam > bowtie.bed

2
Entering edit mode
8.8 years ago
KCC ★ 4.0k

Would any aligner give a result like this:

GGGGGGGGG 6          chr13   -           72592620        72592644        1
GGGGGGGGG 6          chr14   -           46332831        46332855        1


identical sequence, but different aligned position ... that seems like that would only happen if the sequence is non-uniquely mapped, and the sequence is assigned at random. In that case, listing this kind of thing multiple times might be deceptive for some applications.

This is started out as a comment, but I will try to answer your question. I can't think of anything that doesn't involve wiring a program or a script or both.

First Approach

My first idea would be to write a script that directly translates your SAM to a BED. In python, I would make a first pass through my SAM and store all of the sequences in a dictionary with the sequence as the key, and the counts of the sequence as the value.

Then I would make a second pass through my SAM file and translate it directly to BED format. I have asked a lot of questions on Biostar about how to get information about where sequences are mapped SAM, so searching on biostar should be enough to put the whole story together.

Second Approach

My second idea is find a program that can translate SAM to BED with the sequence in the name field of the BED. I don't know any that do this, but perhaps there is one out there. Then use a unix trick to sort and count, like "uniq -c"

Third Approach

If you can't find a program that translates directly to BED in the way you want, and you don't want to write on yourself, then ttake your SAM and filter for only the mapped reads using samtools.

samtools view -S -F0x4 input.sam > filtered.sam


Convert your SAM to BED with samtools and bedtools

samtools -b -S filtered.sam > filtered.bam
bamToBed -i filtered.bam > filtered.bed


I think both the filtered.bed file and the filtered.sam file should have the same number of lines and information for each tag in the same order. This is a guess and could be wrong. Anyway, from here you should be able to merge the information in these files using awk or something like that to bed a bed file with the sequence in the name field.

Finally you can use uniq -c as mentioned above.