Acquiring protein sequences of each chromosome of Human, Chimp and Gorilla sorted by their appearance on genome
Entering edit mode
6.0 years ago
Soheil ▴ 70

Hi everyone,

I need to download protein sequences of each chromosome of Human, Chimp and Gorilla and I need them to be sorted according to their appearance on the genome.

I checked a few famous databases such as Ensemble and NCBI. None of them provide what I want. I wonder if anyone have a suggestion.


PS: NCBI provides proteomes in a single FastA File. But unfortunately chromosome number and location of protein on the chromosome is not specified in the header of each protein. For example, for gorilla it is something like this:

gi|426327303|ref|XP_004024460.1| PREDICTED: WAS protein family homolog 2-like [Gorilla gorilla gorilla]

Ensemble provides more information in the header. It is something like:

ENSGGOP00000020402.1 pep:known_by_projection chromosome:gorGor3.1:19:44522199:44523021:1 gene:ENSGGOG00000027893.1 transcript:ENSGGOT00000028158.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:INAFM1 description:InaF-motif containing 1 [Source:HGNC Symbol;Acc:HGNC:27406]

At the first look, it seemed to be the answer. The part ":19:44522199:44523021" declares that this protein belongs to chromosome #19 and it starts at 44522199 and ends at 44523021. So I wrote a little script to extract all proteins that belong to some chromosome and then sort them according to their start position. However, I noticed something that ruined the solution.

For some human protein the header is something like:

ENSP00000487931.1 pep:known chromosome:GRCh38:CHR_HSCHR19_2_CTG3_1:34365020:34377596:1 gene:ENSG00000282019.1 transcript:ENST00000632809.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:GPI description:glucose-6-phosphate isomerase [Source:HGNC Symbol;Acc:HGNC:4458]

As you can see it says this protein belongs to chromosome "CHR_HSCHR19_2_CTG3_1" which is not clear to me. This makes it very difficult to extract the proteins for each chromosome. What does it mean? Homo Sapien Chromosome 19 Contig 3? What do _2_ and _1 stand for?!

protein sequence chromosome • 1.8k views
Entering edit mode
6.0 years ago
Denise CS ★ 5.2k

The name "CHR_HSCHR19_2_CTG3_1:34365020:34377596:1 rather than 19:34365020-34377596 tells us that the former is located in alternative sequences of the human genome, also known as haplotypes and patches (check this tutorial on YouTube). Haplotypes are determined by the Genome Reference Consortium. If you are interested in the primary assembly only, just ignore everything containing 'weird' names such HSCHR, etc. So, do not worry too much what _2_and _1 mean (although you are right CHR_HSCHR19_2_CTG3 is the contig name according to GRC). You will find many cases like that as Ensembl gives you the annotation on all known alternative sequences (check What haplotypes and assembly patches can I see for human?), in addition to the primary assembly. @DevonRyan has pointed us to his rather useful ChromosomeMappings between the Ensembl chromosomes (in the primary and alternative assemblies) and UCSC names. This can help you disentangling the nomenclature issue. You can also view the comparison between the two regions (on the primary assembly and haplotype CHR_HSCHR19_2_CTG3) in the Ensembl Browser.


Login before adding your answer.

Traffic: 2184 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6