Obtaining All Cds Sequences (I.E. All Spliced Exon Variants) From Ucsc
1
0
Entering edit mode
7.7 years ago
Max ▴ 140

In order to estimate dN/dS for various genes, I need the entire coding sequence. I have been working with the list of cds exon sequences provided from the UCSC Tables browser for the human reference genome, and one of the problems that I'm facing is that if I attempt to concatenate them into a single sequence for PAML, HYPHY, etc, I have to deal with the fact that each exon is on a potentially different reading frame.

Therefore, I need to know if there is some efficient means of extracting the entire cds sequence with the exons already concatenated and adjusted into a single 0 to modulus 3 reading frame. I don't see a cds option as such listed (although the tables provide coordinates for cds Start/End). In other words, I need a complete cds of every alternative splicing of exons, so that each cds can be "read" from start to end in a single frame.

I seem to remember that UCSC could return the complete cds for each alternative splicing as well as just giving the list of exons, but I don't see this option listed. The closest that I've been able to find is to restrict the list of returned exons to those that appear in the coding sequences.

cds ucsc • 2.6k views
0
Entering edit mode
7.7 years ago

not sure if the ucsc will allow you to curl all the mRNA, but the following script seems to work:

curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz" |\
gunzip -c |\
cut -d '      ' -f 1 |\
do
curl  -s "http://genome.ucsc.edu/cgi-bin/hgGene?hgg_do_getMrnaSeq=1&hgg_gene=\${F}&db=hg19" |\
sed -s 's%<[/]*$$PRE\|TT$$>%%g'
done

0
Entering edit mode

Is the mRNA on UCSC primary transcript, or post intron-splicing cds?