Interpreting Genomic Coordinates
2
1
Entering edit mode
9.3 years ago
Mal0 ▴ 40

Hi, I have a very basic question.

I want to map genomic coordinates to sequences, but I am not sure how to interpret the coordinates for the minus strand. As an example, using the UCSC Genome Browser I get

\>hg18_refGene_NM_000474 range=chr7:19121616-19122860 5'pad=0 3'pad=0 strand=- repeatMasking=none
CAGGCGGAGCCCCCCACCCCCTCAGCAGGGCCGGAGACCTAGATGTCATT
...


Now I downloaded chr7.fa of hg18. Since every line contains 50 bases, starting from line 2, I supposed to find the sequence on line 382434.

sed -n 382434p chr7.fa
GACCAAACTCTAAGGTTCTCtaaattttttatatttatttattGCAGAAA


This does not match. I also could not find the reverse complement of the first sequence in chr7.fa using grep. Could anyone tell me where I go wrong?

Thanks.

ucsc sequence coordinates • 2.8k views
4
Entering edit mode
9.3 years ago

The following command line shows that your sequence is the reverse complement of the fragment 19121616-19122860 of chr7.

$curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr7.fa.gz" |\ gunzip -c | tail -n +2 | tr -d "\n" | cut -c 19121616-19122860 |\ tr "ATGCatgc" "01230123" | tr "0123" "TACG" |\ rev CAGGCGGAGCCCCCCACCCCCTCAGCAGGGCCGGAGACCTAGGTAAGGACCGCGCCGCTGCACCCCTTCGCCTCTCAGGTGGCAGA CGGCAGGCCGGCCAGGCCGCGGTTCCCAGTCCACCTCGATTTCCTCCCCTCTCCCACTCTCCGCTCAGCCTTCCCACCTCACTTGG CACCGTTGCCTCGCGCCCCCAGCGTCCCCGGAAGGCCGGTCTGACCCCGCTAGGGAGAGCAGTCTCCAGGGGGATGCGCCCTGGTG (...)GAGAA  ADD COMMENT 1 Entering edit mode +1 for rev, a forgotten goodie. ADD REPLY 0 Entering edit mode thanks a lot, i see my mistake now. ADD REPLY 3 Entering edit mode 9.3 years ago if you count lines, then you have to have in mind 2 things: the first one is that the first line on a fasta file contains the sequence code, and the other one is that each line has an offset of 2x50 bases respect to the line counter. generally: linefirstbaseposition = ( (linenumber - 2) x 50 ) + 1  in your example, the bases contained in line 382434 would start from position 19121601, which is exactly what you would get from UCSC if you query for chr7:19,121,601-19,121,650. regarding the reverse strand, the .fa sequences that you can bulk download from UCSC are always forward stranded. if you need the complementary sequence I would suggest to build the same sed query plus a translation code to change As to Ts, Ts to As, Cs to Gs and Gs to Cs, being case sensitive if you want to maintain the original base case. ADD COMMENT 0 Entering edit mode This a better way to do it. I wrote it in a little complicated manner to get the correct sequence and then$seq =~ tr/ACGT/TGCA/;$seq = reverse($seq); thanks