Question: Identify Domains In Dna Sequences
gravatar for Dimitris Polychronopoulos
7.1 years ago by
United Kingdom/London/Imperial College London

Dear all,

I have the following output from a HMM.

>hg19_dna range=chr7:113770949-115775846 5'pad=0 3'pad=0 strand=+ repeatMasking=none





and also with Ns. For example:


These are all stored in a file. I would like to identify and take those regions with Us and those with Ns and make a BED file out of those according to the coordinates that are specified above. Also to generalize for more symbols, not only two (Us and Ns) but for now two is enough. Anyone that could help???

Thank you in advance

perl python script • 1.6k views
ADD COMMENTlink modified 7.1 years ago by Alex Reynolds29k • written 7.1 years ago by Dimitris Polychronopoulos0
gravatar for Alex Reynolds
7.1 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

You could convert your FASTA input (assuming it is FASTA) with the following Perl script:

#!/usr/bin/env perl

use strict;
use warnings;

while (<>) {
    if ($_ =~ /^>/) {
        my $ln = $_;
        $ln =~ s/^>//s;
        my ($id, $rangeField, $fivePad, $threePad, $strandField, $maskFlag) = split(/[\s]/, $ln);
        my ($rangeKey, $chr, $start, $stop) = split(/[=|:|-]/, $rangeField);
        my ($strandKey, $strand) = split(/[=]/, $strandField);
        my $score = 0;
        print STDOUT join("\t", ($chr, $start, $stop, $id, $score, $strand))."\n";

To use it, for example:

$ < my_hmm.fa
chr7    113770949       115775846       hg19_dna        0       +

To save the result to a sorted BED file, for example:

$ < my_hmm.fa | sort-bed - > my_hmm.bed
ADD COMMENTlink modified 7.1 years ago • written 7.1 years ago by Alex Reynolds29k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1691 users visited in the last hour