Question

Interpreting Genomic Coordinates

1

Entering edit mode

12.3 years ago

Mal0 ▴ 40

Hi, I have a very basic question.

I want to map genomic coordinates to sequences, but I am not sure how to interpret the coordinates for the minus strand. As an example, using the UCSC Genome Browser I get

\>hg18_refGene_NM_000474 range=chr7:19121616-19122860 5'pad=0 3'pad=0 strand=- repeatMasking=none
CAGGCGGAGCCCCCCACCCCCTCAGCAGGGCCGGAGACCTAGATGTCATT
...

Now I downloaded chr7.fa of hg18. Since every line contains 50 bases, starting from line 2, I supposed to find the sequence on line 382434.

sed -n 382434p chr7.fa 
GACCAAACTCTAAGGTTCTCtaaattttttatatttatttattGCAGAAA

This does not match. I also could not find the reverse complement of the first sequence in chr7.fa using grep. Could anyone tell me where I go wrong?

Thanks.

ucsc sequence coordinates • 3.5k views

ADD COMMENT • link updated 12.3 years ago by Pierre Lindenbaum 161k • written 12.3 years ago by Mal0 ▴ 40

score 4 · Answer 1 · 2011-12-27

4

Entering edit mode

12.3 years ago

Pierre Lindenbaum 161k

The following command line shows that your sequence is the reverse complement of the fragment 19121616-19122860 of chr7.

$ curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr7.fa.gz" |\
gunzip -c | tail -n +2 | tr -d "\n" | cut -c 19121616-19122860 |\
tr "ATGCatgc" "01230123" | tr "0123" "TACG" |\
rev

CAGGCGGAGCCCCCCACCCCCTCAGCAGGGCCGGAGACCTAGGTAAGGACCGCGCCGCTGCACCCCTTCGCCTCTCAGGTGGCAGA
CGGCAGGCCGGCCAGGCCGCGGTTCCCAGTCCACCTCGATTTCCTCCCCTCTCCCACTCTCCGCTCAGCCTTCCCACCTCACTTGG
CACCGTTGCCTCGCGCCCCCAGCGTCCCCGGAAGGCCGGTCTGACCCCGCTAGGGAGAGCAGTCTCCAGGGGGATGCGCCCTGGTG
(...)GAGAA

ADD COMMENT • link 12.3 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

+1 for rev, a forgotten goodie.

ADD REPLY • link 12.3 years ago by Aaronquinlan 12k

0

Entering edit mode

thanks a lot, i see my mistake now.

ADD REPLY • link 12.3 years ago by Mal0 ▴ 40

score 3 · Answer 2 · 2011-12-27

if you count lines, then you have to have in mind 2 things: the first one is that the first line on a fasta file contains the sequence code, and the other one is that each line has an offset of 2x50 bases respect to the line counter. generally:

linefirstbaseposition = ( (linenumber - 2) x 50 ) + 1

in your example, the bases contained in line 382434 would start from position 19121601, which is exactly what you would get from UCSC if you query for chr7:19,121,601-19,121,650.

regarding the reverse strand, the .fa sequences that you can bulk download from UCSC are always forward stranded. if you need the complementary sequence I would suggest to build the same sed query plus a translation code to change As to Ts, Ts to As, Cs to Gs and Gs to Cs, being case sensitive if you want to maintain the original base case.