Question: chrom.sizes computed locally
0
gravatar for ypriverol
8 days ago by
ypriverol0
ypriverol0 wrote:

Hi all:

In order to be able to convert from bed files to bigbed using the UCSC tool (bedToBigBed) chrom.sizes is needed. How can this number be computed without querying the UCSC and ENSEMBL APIs?.

ensembl ucsc bed bigbed • 224 views
ADD COMMENTlink modified 6 days ago by cpad01121.9k • written 8 days ago by ypriverol0

How can this number be computed without querying the UCSC

what do you mean with "querying" ? http ? mysql ? this is a small file. what's wrong with having a local copy ? or do you have a local copy of the FASTA sequences ?

ADD REPLYlink written 8 days ago by Pierre Lindenbaum98k

I would like to do it based on the fasta files they provide in the FTP.

ADD REPLYlink modified 8 days ago • written 8 days ago by ypriverol0

The problem is that known of the APIs said on which files these numbers are compute.

ADD REPLYlink written 8 days ago by ypriverol0

Do you know how to do it from the fasta files ensembl provides here: https://m.ensembl.org/info/data/ftp/index.html

ADD REPLYlink written 8 days ago by ypriverol0
1
~$ curl -s "ftp://ftp.ensembl.org/pub/release-90/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.chromosome.Y.fa.gz" | gunzip -c | grep -v '^>' | tr -d '\n' | wc -c
57227415
ADD REPLYlink written 8 days ago by Pierre Lindenbaum98k
1
$ curl -s "ftp://ftp.ensembl.org/pub/current_mysql/homo_sapiens_core_90_38/seq_region.txt.gz" | gunzip -c| awk '($3=="4")'  | grep -v CHR | cut -f 2,4 | sort -k2,2n
MT  16569
21  46709983
22  50818468
Y   57227415
19  58617616
20  64444167
18  80373285
17  83257441
16  90338345
15  101991189
14  107043718
13  114364328
12  133275309
10  133797422
11  135086622
9   138394717
8   145138636
X   156040895
7   159345973
6   170805979
5   181538259
4   190214555
3   198295559
2   242193529
1   248956422
ADD REPLYlink written 8 days ago by Pierre Lindenbaum98k

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLYlink written 8 days ago by genomax33k
2
gravatar for Pierre Lindenbaum
8 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum98k wrote:
~$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -D hg19 -N -e 'select chrom,size from chromInfo' > out.txt

$ cat out.txt
chr1    249250621
chr2    243199373
chr3    198022430
chr4    191154276
chr5    180915260
(....)

or just

 curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz" | gunzip -c | cut -f 1,2 > out.txt

these numbers are pre-computed from the fasta genome files. e.g for chr1:

$ curl -s "http://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/chr1.fa.gz" | gunzip -c | grep -v '^>' | tr -d '\n' | wc -c
249250621
ADD COMMENTlink modified 8 days ago • written 8 days ago by Pierre Lindenbaum98k
2
gravatar for genecats.ucsc
8 days ago by
genecats.ucsc340
genecats.ucsc340 wrote:

The chrom.sizes file is computed in the following way for all assemblies at UCSC:

faToTwoBit organism.fa organism.2bit
twoBitInfo out.2bit stdout | sort -k2rn > organism.chrom.sizes

If you know the URL to a 2bit file we've already made, twoBitInfo accepts a URL like so:

twoBitInfo -udcDir=. http://genome-test.cse.ucsc.edu/~hiram/hubs/Plants/ricCom1/ricCom1.2bit stdout | sort -k2nr > ricCom1.chrom.sizes

If you want the chrom.sizes file for a particular assembly, you can download from a URL like the following: http://hgdownload.cse.ucsc.edu/goldenPath/$dbbigZips/$db.chrom.sizes

where $db is the assembly name like hg38, mm10, anoCar2, panTro5, etc.

You can find the faToTwoBit and twoBitInfo programs in our list of publicly available utilities in the directory appropriate to your operating system:

http://hgdownload.soe.ucsc.edu/admin/exe/

If you have further questions about the UCSC Genome Browser or our utilites or data, feel free to send an email to one of mailing lists below:

  • genome@soe.ucsc.edu for general questions (public list)
  • genome-www@soe.ucsc.edu for question concerning private data (private list)
  • genome-mirror@soe.ucsc.edu for questions concerning the setup and running of your own UCSC Genome Browser installation

ChrisL from the UCSC Genome Browser

ADD COMMENTlink written 8 days ago by genecats.ucsc340
2
gravatar for cpad0112
6 days ago by
cpad01121.9k
cpad01121.9k wrote:

Try faCount form UCSC kent utils. Usage and output would look like this:

$ faCount hg38.fa

#seq    len A   C   G   T   N   cpg
chr1    248956422   67070277    48055043    48111528    67244164    18475410    2375159
chr10   133797422   38875926    27639505    27719976    39027555    534460  1388978
chr11   135086622   39286730    27903257    27981801    39361954    552880  1333114
chr11_KI270721v1_random 100316  18375   31042   31012   19887   0   3394
.
.
.
total   3209286105  898285419   623727342   626335137   900967885   159970322   30979743

For your purpose, first two columns would suffice.

ADD COMMENTlink modified 6 days ago • written 6 days ago by cpad01121.9k
1
gravatar for kashifalikhan007
7 days ago by
Cologne
kashifalikhan00740 wrote:

Try samtools

samtools faidx genome.fa

cut -f1,2 genome.fa.fai > genome.size
ADD COMMENTlink written 7 days ago by kashifalikhan00740
1
gravatar for Matt Shirley
7 days ago by
Matt Shirley7.8k
Cambridge, MA
Matt Shirley7.8k wrote:
$ pip install pyfaidx
$ faidx -i chromsizes input.fa > output.chromsizes
ADD COMMENTlink modified 7 days ago • written 7 days ago by Matt Shirley7.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1394 users visited in the last hour