Iupac Bases In The Hg19 1000G Reference ?
2
2
Entering edit mode
11.2 years ago
Gabriel R. ★ 2.9k

If you download the reference from 1000g :

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz

Why is there a double "RR" in chr 3 ?

zcat human_g1k_v37.fasta.gz.1 | grep -n R

1:>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
4154180:>2 dna:chromosome chromosome:GRCh37:2:1:243199373:1
8207504:>3 dna:chromosome chromosome:GRCh37:3:1:198022430:1
9221351:CCRRGCTTGGTTCTAACAATGAATTTAATAAGAATTGTATTTAATCAATGTTTAAATATA

Any more surprises in those files ?

1000genomes human reference • 3.7k views
ADD COMMENT
1
Entering edit mode

'R' is IUPAC : it's A or G.

ADD REPLY
0
Entering edit mode

Is it possible that this is a diploid genome and ambiguity codes are being used to represent sites where the paternal and maternal chromosomes differ? I haven't looked too much at the 1000g data.

ADD REPLY
2
Entering edit mode
11.2 years ago
deanna.church ★ 1.1k

Remember, the human genome was assembled by first assembling clones. The clones are then assembled to make the chromosome sequences. This ambiguity must be in the clone used for that part of the genome. In some cases, the group that assembled the clones annotated regions that were difficult or ambiguous. The GRC has attempted to map these up to the top level assembly sequences. http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/

ADD COMMENT
1
Entering edit mode

yes ... but why would that position be unique in the entire genome ?

ADD REPLY
1
Entering edit mode

Hi Samuel, I'm not sure I understand your question. Can you elaborate?

ADD REPLY
1
Entering edit mode

There must be tons of positions like you described in the assembly right ? Why leave a single one out ? What is so special about that position and unspecial about the others ?

ADD REPLY
2
Entering edit mode

Oh- I see. Well- there is nothing really special about this location. This is really just a by-product of the Human Genome Project (HGP). The clones used to assembly the human genome were sequenced in many labs around the world. Each lab had their own standards and protocols for clone sequencing. Over the course of the project, there were efforts to standardize things, but as with any large project this took time and the adoptions of standards at the labs varied. The Genome Reference Consortium (GRC) has taken over the human reference assembly (http://genomereference.org) and one thing they have tried to do is 'productize' the assembly- they published this paper: http://www.ncbi.nlm.nih.gov/pubmed/21750661. If you think these issues should be adressed in the assembly, use the 'Report an Issue' tab at the GRC tab to communicate to the group. And: these have been fixed; patches have been released and the bases will be corrected in GRCh38: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?ID=HG-1091

ADD REPLY
1
Entering edit mode
11.2 years ago

the two 'R' overlap the following SNPs:

rs62249333 chr3:60830764-60830764
rs62249332 chr3:60830763-60830763
rs71616828 chr3:60830763-60830764 (Observed: AA/RR    )

here , the sequencing center chose to put the variation in the reference genome.

ADD COMMENT
1
Entering edit mode

Pierre, thanks for the post. Although I feel it's more a comment than an answer per se. There are tons of variations for the human genome, why would this variation be worthy of having it's own base in the reference ? In my mind, either you change all the base pertaining to a variation (given some cutoff) or you leave everything as is.

ADD REPLY
0
Entering edit mode

Yes, I see your point now

ADD REPLY
0
Entering edit mode

it's not even coding, it's in an intron :-)

ADD REPLY

Login before adding your answer.

Traffic: 1734 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6