Question: chrom.sizes computed locally
0
gravatar for ypriverol
10 weeks ago by
ypriverol0
ypriverol0 wrote:

Hi all:

In order to be able to convert from bed files to bigbed using the UCSC tool (bedToBigBed) chrom.sizes is needed. How can this number be computed without querying the UCSC and ENSEMBL APIs?.

ensembl ucsc bed bigbed • 327 views
ADD COMMENTlink modified 10 weeks ago by cpad01123.1k • written 10 weeks ago by ypriverol0

How can this number be computed without querying the UCSC

what do you mean with "querying" ? http ? mysql ? this is a small file. what's wrong with having a local copy ? or do you have a local copy of the FASTA sequences ?

ADD REPLYlink written 10 weeks ago by Pierre Lindenbaum101k

I would like to do it based on the fasta files they provide in the FTP.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by ypriverol0

The problem is that known of the APIs said on which files these numbers are compute.

ADD REPLYlink written 10 weeks ago by ypriverol0

Do you know how to do it from the fasta files ensembl provides here: https://m.ensembl.org/info/data/ftp/index.html

ADD REPLYlink written 10 weeks ago by ypriverol0
1
~$ curl -s "ftp://ftp.ensembl.org/pub/release-90/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.chromosome.Y.fa.gz" | gunzip -c | grep -v '^>' | tr -d '\n' | wc -c
57227415
ADD REPLYlink written 10 weeks ago by Pierre Lindenbaum101k
1
$ curl -s "ftp://ftp.ensembl.org/pub/current_mysql/homo_sapiens_core_90_38/seq_region.txt.gz" | gunzip -c| awk '($3=="4")'  | grep -v CHR | cut -f 2,4 | sort -k2,2n
MT  16569
21  46709983
22  50818468
Y   57227415
19  58617616
20  64444167
18  80373285
17  83257441
16  90338345
15  101991189
14  107043718
13  114364328
12  133275309
10  133797422
11  135086622
9   138394717
8   145138636
X   156040895
7   159345973
6   170805979
5   181538259
4   190214555
3   198295559
2   242193529
1   248956422
ADD REPLYlink written 10 weeks ago by Pierre Lindenbaum101k

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLYlink written 10 weeks ago by genomax37k
2
gravatar for Pierre Lindenbaum
10 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum101k wrote:
~$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -D hg19 -N -e 'select chrom,size from chromInfo' > out.txt

$ cat out.txt
chr1    249250621
chr2    243199373
chr3    198022430
chr4    191154276
chr5    180915260
(....)

or just

 curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz" | gunzip -c | cut -f 1,2 > out.txt

these numbers are pre-computed from the fasta genome files. e.g for chr1:

$ curl -s "http://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/chr1.fa.gz" | gunzip -c | grep -v '^>' | tr -d '\n' | wc -c
249250621
ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by Pierre Lindenbaum101k
2
gravatar for genecats.ucsc
10 weeks ago by
genecats.ucsc420
genecats.ucsc420 wrote:

The chrom.sizes file is computed in the following way for all assemblies at UCSC:

faToTwoBit organism.fa organism.2bit
twoBitInfo out.2bit stdout | sort -k2rn > organism.chrom.sizes

If you know the URL to a 2bit file we've already made, twoBitInfo accepts a URL like so:

twoBitInfo -udcDir=. http://genome-test.cse.ucsc.edu/~hiram/hubs/Plants/ricCom1/ricCom1.2bit stdout | sort -k2nr > ricCom1.chrom.sizes

If you want the chrom.sizes file for a particular assembly, you can download from a URL like the following: http://hgdownload.cse.ucsc.edu/goldenPath/$dbbigZips/$db.chrom.sizes

where $db is the assembly name like hg38, mm10, anoCar2, panTro5, etc.

You can find the faToTwoBit and twoBitInfo programs in our list of publicly available utilities in the directory appropriate to your operating system:

http://hgdownload.soe.ucsc.edu/admin/exe/

If you have further questions about the UCSC Genome Browser or our utilites or data, feel free to send an email to one of mailing lists below:

  • genome@soe.ucsc.edu for general questions (public list)
  • genome-www@soe.ucsc.edu for question concerning private data (private list)
  • genome-mirror@soe.ucsc.edu for questions concerning the setup and running of your own UCSC Genome Browser installation

ChrisL from the UCSC Genome Browser

ADD COMMENTlink written 10 weeks ago by genecats.ucsc420
2
gravatar for cpad0112
10 weeks ago by
cpad01123.1k
cpad01123.1k wrote:

Try faCount form UCSC kent utils. Usage and output would look like this:

$ faCount hg38.fa

#seq    len A   C   G   T   N   cpg
chr1    248956422   67070277    48055043    48111528    67244164    18475410    2375159
chr10   133797422   38875926    27639505    27719976    39027555    534460  1388978
chr11   135086622   39286730    27903257    27981801    39361954    552880  1333114
chr11_KI270721v1_random 100316  18375   31042   31012   19887   0   3394
.
.
.
total   3209286105  898285419   623727342   626335137   900967885   159970322   30979743

For your purpose, first two columns would suffice.

ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by cpad01123.1k
1
gravatar for kashifalikhan007
10 weeks ago by
Cologne
kashifalikhan00740 wrote:

Try samtools

samtools faidx genome.fa

cut -f1,2 genome.fa.fai > genome.size
ADD COMMENTlink written 10 weeks ago by kashifalikhan00740
1
gravatar for Matt Shirley
10 weeks ago by
Matt Shirley8.0k
Cambridge, MA
Matt Shirley8.0k wrote:
$ pip install pyfaidx
$ faidx -i chromsizes input.fa > output.chromsizes
ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by Matt Shirley8.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 698 users visited in the last hour