Non-Redundant List Of Grch37 Coordinates Covered By Segmental Duplications
2
0
Entering edit mode
12.5 years ago
Dgmacarthur ▴ 310

Hi all,

Hopefully an easy one: I'm looking to get a file containing the coordinates of every base in the human build 37 genome that is covered by a segmental duplication (e.g. a BED file).

I've downloaded the full set of seg dups from http://humanparalogy.gs.washington.edu/build37/build37.htm but these appear to contain a redundant set of all pairwise locations of segmental duplications. I could write some code to merge these, but has anyone already generated a non-redundant file that simply tells me whether a given GRCh37 base is in fact spanned by a seg dup?

human • 4.2k views
ADD COMMENT
1
Entering edit mode

Please note that different databases may give you vastly different results. The first question to ask is "which is the most accurate" instead of "which is the most convenient". Merging overlapping regions in a BED is extremely easy. You can use bedtools, or just one line of awk.

ADD REPLY
4
Entering edit mode
12.5 years ago

Is this what you are looking for: http://genome.ucsc.edu/cgi-bin/hgTrackUi?g=genomicSuperDups

If you want BED format, use the Table Browser, click this link, select BED from the "output format" dropdown menu, click "get output" and then click "get BED" on the next page

ADD COMMENT
1
Entering edit mode

Just a note that if you do this from Galaxy, you can then merge the overlapping bed records and get the unique bed regions covered by at least one segmental duplication.

ADD REPLY
1
Entering edit mode
12.5 years ago

The following C++ should provide a true/false WIG file:

#include <iostream>
#include <vector>
#include <string>
#include <cstdlib>

using namespace std;

static void wig(string& chrom, vector<bool>& bits)
    {
    size_t i=0;
    if(chrom.empty()) return;
    while(i< bits.size() && bits[i]==false) i++;
    /* what is the first base for wig 0 or 1 ?? */
    cout << "fixedStep chrom="<< chrom <<" start="<<i<<" step=1 span=1" << endl;

    while(i< bits.size())
        {
        cout << (int)bits[i++]<< endl;
        }
    }

int main(int argc,char** argv)
    {
    vector<bool> bits;
    string chrom;
    string line;
    while(getline(cin,line,'\n'))
        {
        if(line.empty() or line[0]=='#') continue;
        string::size_type n1=line.find('\t',0);
        if(n1==0 || n1==string::npos) continue;
        string::size_type n2=line.find('\t',n1+1);
        if(n2==string::npos) continue;
        string s=line.substr(0,n1);
        if(s.compare(chrom)!=0)
            {
            wig(chrom,bits);
            bits.clear();
            chrom=s;
            }
        s=line.substr(n1+1,n2-n1);
        char* p2;
        long chromStart=strtol(s.c_str(),&p2,10);

        if(chromStart< 0 || *p2!='\t')
            {
            cerr << "bad start in " << s << endl;   
            continue;
            }
        s=line.substr(n2+1);

        long chromEnd=strtol(s.c_str(),&p2,10);
        if(chromEnd< chromStart || (*p2!=0 && *p2!='\t'))
            {
            cerr << "bad end in " << s << endl; 
            continue;
            }

        if(bits.size()<=(size_t)chromEnd)
            {
            bits.resize(chromEnd,false);
            }
        while(chromStart<chromEnd)
            {
            bits[(size_t)chromStart]=true;
            ++chromStart;
            }
        }
    wig(chrom,bits);
    return 0;
    }

test:

$ g++ jeter.cpp
$ mysql  --user=genome -N --host=genome-mysql.cse.ucsc.edu -A   -D hg19 -e "select chrom,chromStart,chromEnd from genomicSuperDups where chromEnd <100000 " |\
./a.out  |\
grep chrom -A 2 -B 2

fixedStep chrom=chr11_gl000202_random start=0 step=1 span=1
1
1
--
1
1
fixedStep chrom=chr17_gl000203_random start=8 step=1 span=1
1
1
--
1
1
fixedStep chrom=chr17_gl000205_random start=0 step=1 span=1
1
1
--
1
1
fixedStep chrom=chr1_gl000192_random start=1270 step=1 span=1
1
ADD COMMENT

Login before adding your answer.

Traffic: 1604 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6