Interpreting Genomic Coordinates
2
1
Entering edit mode
12.3 years ago
Mal0 ▴ 40

Hi, I have a very basic question.

I want to map genomic coordinates to sequences, but I am not sure how to interpret the coordinates for the minus strand. As an example, using the UCSC Genome Browser I get

\>hg18_refGene_NM_000474 range=chr7:19121616-19122860 5'pad=0 3'pad=0 strand=- repeatMasking=none
CAGGCGGAGCCCCCCACCCCCTCAGCAGGGCCGGAGACCTAGATGTCATT
...

Now I downloaded chr7.fa of hg18. Since every line contains 50 bases, starting from line 2, I supposed to find the sequence on line 382434.

sed -n 382434p chr7.fa 
GACCAAACTCTAAGGTTCTCtaaattttttatatttatttattGCAGAAA

This does not match. I also could not find the reverse complement of the first sequence in chr7.fa using grep. Could anyone tell me where I go wrong?

Thanks.

ucsc sequence coordinates • 3.5k views
ADD COMMENT
4
Entering edit mode
12.3 years ago

The following command line shows that your sequence is the reverse complement of the fragment 19121616-19122860 of chr7.

$ curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr7.fa.gz" |\
gunzip -c | tail -n +2 | tr -d "\n" | cut -c 19121616-19122860 |\
tr "ATGCatgc" "01230123" | tr "0123" "TACG" |\
rev

CAGGCGGAGCCCCCCACCCCCTCAGCAGGGCCGGAGACCTAGGTAAGGACCGCGCCGCTGCACCCCTTCGCCTCTCAGGTGGCAGA
CGGCAGGCCGGCCAGGCCGCGGTTCCCAGTCCACCTCGATTTCCTCCCCTCTCCCACTCTCCGCTCAGCCTTCCCACCTCACTTGG
CACCGTTGCCTCGCGCCCCCAGCGTCCCCGGAAGGCCGGTCTGACCCCGCTAGGGAGAGCAGTCTCCAGGGGGATGCGCCCTGGTG
(...)GAGAA
ADD COMMENT
1
Entering edit mode

+1 for rev, a forgotten goodie.

ADD REPLY
0
Entering edit mode

thanks a lot, i see my mistake now.

ADD REPLY
3
Entering edit mode
12.3 years ago

if you count lines, then you have to have in mind 2 things: the first one is that the first line on a fasta file contains the sequence code, and the other one is that each line has an offset of 2x50 bases respect to the line counter. generally:

linefirstbaseposition = ( (linenumber - 2) x 50 ) + 1

in your example, the bases contained in line 382434 would start from position 19121601, which is exactly what you would get from UCSC if you query for chr7:19,121,601-19,121,650.

regarding the reverse strand, the .fa sequences that you can bulk download from UCSC are always forward stranded. if you need the complementary sequence I would suggest to build the same sed query plus a translation code to change As to Ts, Ts to As, Cs to Gs and Gs to Cs, being case sensitive if you want to maintain the original base case.

ADD COMMENT
0
Entering edit mode

This a better way to do it. I wrote it in a little complicated manner to get the correct sequence and then $seq =~ tr/ACGT/TGCA/;$seq = reverse($seq); thanks

ADD REPLY

Login before adding your answer.

Traffic: 1577 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6