gc content file for human genome large and difficult to parse into separate chromosome files
2
0
Entering edit mode
6.6 years ago

I have just downloaded the following 'gc content' documenting file: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/hg19.gc5Base.txt.gz

The file contains data for all chromosomes in hg19. *it is not in bed format which makes it difficult to parse in terms of separating into chromosome files *

>head hg19.gc5Base.txt
variableStep chrom=chr1 span=5
10001   40
10006   40
10011   40
10016   60
10021   60
10026   60
10031   40
10036   40
10041   40

Has anyone come accross a way to separate this file into separate chromosome files with a lower resolution- e.g. its currently at a 5 bp resolution but I want a 500000 bp resolution.

My current way of doing this (extremely inefficient) is (done in R):

data=read.table('hg19.gc5Base.txt', sep='\t', header = F, fill=T)


 head(data)
                                   V1 V2
1  variableStep chrom=chr1 span=5 NA
2                           10001 40
3                           10006 40
4                           10011 40
5                           10016 60
6                           10021 60
7                           10026 60
8                           10031 40
9                           10036 40
library(zoo) ## to smooth this data
idx=grep(data[,1], pattern='variable') ### find each position where new chromosome starts
for(i in c(1:length(idx))){
       if(i==23){i='X'}
       if(i==24){i='Y'}
       smoothed=rollapply(data[c(idx[i:i+1]),2], width=2, function(x) mean(x, na.rm=T))
       write.table(smoothed,paste0('chr',i), sep='\t', row.names = F, col.names = F, quote=F)
  }

This is taking an extremely long time... does anyone know of a better more efficient way of doing this?

Assembly • 3.4k views
ADD COMMENT
2
Entering edit mode

The file contains data for all chromosomes in hg19. *it is not in bed format which makes it difficult to parse in terms of separating into chromosome files *

it's a wig file. see http://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/wig2bed.html for converting to bed

ADD REPLY
3
Entering edit mode
6.6 years ago
igor 13k

You can use bedtools nuc to get GC content at any resolution. The input BED file that defines the resolution can be generated with bedtools makewindows.

Check this previous discussion: Calculating Gc Content For All Ccds

ADD COMMENT
1
Entering edit mode
6.6 years ago

Use BEDOPS tools to convert, split and manipulate genomic interval datasets:

Your first file is a WIG file, which you can convert with BEDOPS wig2bed:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenPath/hg19/hg19.gc5Base.txt.gz | gunzip -c | wig2bed - > hg19.gc5Base.bed

Now that you have GC signal as a sorted BED file, you can quickly split it by chromosome with BEDOPS bedextract:

$ for chr in `bedextract --list-chr hg19.gc5Base.bed`; do echo $chr; bedextract ${chr} hg19.gc5Base.bed > hg19.gc5Base.${chr}.bed; done

If you want the average content over a 500kb window, make 500kb windows over the reference genome, pipe the windows to BEDOPS bedmap and map against the per-chromosome GC signal file. Any mapped GC signal is averaged with the --mean operator:

$ fetchChromSizes hg19 | awk -v OFS="\t" '($1!~/_/){print $1,"0",$2}' | sort-bed - > hg19.bed
$ bedops --chop 500000 hg19.bed > hg19.500kb.bed
$ for chr in `bedextract --list-chr hg19.gc5Base.bed`; do echo $chr; bedmap --chrom ${chr} --echo --mean --delim '\t' hg19.500kb.bed hg19.gc5Base.${chr}.bed > hg19.500kb.meanGC.${chr}.bed; done

This should run very fast. The bedextract and bedmap tools do a binary search to delimit sorted BED files by chromosome.

This problem then becomes trivially parallelizable, though this should run very fast in serial as written.

The fetchChromSizes tool is available from the UCSC Kent utilities suite.

ADD COMMENT

Login before adding your answer.

Traffic: 1959 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6