Question

Identify Domains In Dna Sequences

0

Entering edit mode

11.2 years ago

Dimitris Polychronopoulos • 0

Dear all,

I have the following output from a HMM.

>hg19_dna range=chr7:113770949-115775846 5'pad=0 3'pad=0 strand=+ repeatMasking=none
?V UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU

?V UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU

?V UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU

?V UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU

?V UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU

and also with Ns. For example:

?V
NNNNNNNNNNNNNNNNNNNNNNNN

These are all stored in a file. I would like to identify and take those regions with Us and those with Ns and make a BED file out of those according to the coordinates that are specified above. Also to generalize for more symbols, not only two (Us and Ns) but for now two is enough. Anyone that could help???

Thank you in advance

perl python script • 2.2k views

ADD COMMENT • link updated 11.2 years ago by Alex Reynolds 35k • written 11.2 years ago by Dimitris Polychronopoulos • 0

score 0 · Answer 1 · 2013-03-04

You could convert your FASTA input (assuming it is FASTA) with the following Perl script:

#!/usr/bin/env perl

use strict;
use warnings;

while (<>) {
    if ($_ =~ /^>/) {
        chomp;
        my $ln = $_;
        $ln =~ s/^>//s;
        my ($id, $rangeField, $fivePad, $threePad, $strandField, $maskFlag) = split(/[\s]/, $ln);
        my ($rangeKey, $chr, $start, $stop) = split(/[=|:|-]/, $rangeField);
        my ($strandKey, $strand) = split(/[=]/, $strandField);
        my $score = 0;
        print STDOUT join("\t", ($chr, $start, $stop, $id, $score, $strand))."\n";
    }
}

To use it, for example:

$ hmm_fasta2bed.pl < my_hmm.fa
chr7    113770949       115775846       hg19_dna        0       +
...

To save the result to a sorted BED file, for example:

$ hmm_fasta2bed.pl < my_hmm.fa | sort-bed - > my_hmm.bed