Question

Hg19 Gc Model

0

Entering edit mode

10.3 years ago

Astoulpho ▴ 10

Does anyone know where to find (or how to create) a gc-content file that serves as input to PennCNV? The software has a gc model, but it is for hg18 and its format is different fom gc content file found in UCSC genomes.

cnv hg19 gc • 4.9k views

ADD COMMENT • link updated 10.0 years ago by Bingley ▴ 10 • written 10.3 years ago by Astoulpho ▴ 10

Ram · Answer 1 · 2014-01-13

The following bash script should do the job:

IFS="
"

echo -e "Name\tChr\tPosition\tGC"
curl -s "http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/snp138.txt.gz" | gunzip -c |\
cut -f 2,3,5|\
while read L
do
    C=`echo $L | cut -f 1`
    P=`echo $L | cut -f 2`
    R=`echo $L | cut -f 3`
    S=`echo $L |awk '{beg=int($2)-500; if(beg<1) beg=1; printf("%s:%d,%d",$1,beg,int($2)+500);}'`

    echo -e -n "${R}\t${C}\t${P}\t" | sed 's/chr//'
    xmllint --xpath "/DASDNA/SEQUENCE/DNA/text()" "http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=$S" |\
    tr -d "\n" | awk '{L1=1.0*length($0);c=$0;gsub(/[^GCgc]/,"",c);L2=1.0*length(c); printf("%f\n",100.0*(L2/L1));}'
done

explain: IFS is the internal field sepator (using the newline). Download and g-unzip the dbSNP138: cut the chrom,position,name.

extends the position +/ 500pb , use the UCSC das server to fetch the sequence. remove everything that is not GC to get a ratio length(all)/length(GC_only.)

Note: I'm away from the lab, I've used a DAS server to get the GC% but you'd better use a local resource like samtools faidx

$ bash  shell.sh
Name    Chr    Position    GC
rs144773400    1    10144    36.263736
rs201752861    1    10176    38.961039
rs201694901    1    10179    39.260739
rs143255646    1    10228    43.456543
rs200462216    1    10228    43.456543
rs200279319    1    10230    43.656344
rs145599635    1    10233    43.956044
rs148908337    1    10247    44.955045
rs199706086    1    10249    45.154845
rs140194106    1    10254    45.654346