I need to download protein sequences of each chromosome of Human, Chimp and Gorilla and I need them to be sorted according to their appearance on the genome.
I checked a few famous databases such as Ensemble and NCBI. None of them provide what I want. I wonder if anyone have a suggestion.
PS: NCBI provides proteomes in a single FastA File. But unfortunately chromosome number and location of protein on the chromosome is not specified in the header of each protein. For example, for gorilla it is something like this:
gi|426327303|ref|XP_004024460.1| PREDICTED: WAS protein family homolog 2-like [Gorilla gorilla gorilla]
Ensemble provides more information in the header. It is something like:
ENSGGOP00000020402.1 pep:known_by_projection chromosome:gorGor3.1:19:44522199:44523021:1 gene:ENSGGOG00000027893.1 transcript:ENSGGOT00000028158.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:INAFM1 description:InaF-motif containing 1 [Source:HGNC Symbol;Acc:HGNC:27406]
At the first look, it seemed to be the answer. The part ":19:44522199:44523021" declares that this protein belongs to chromosome #19 and it starts at 44522199 and ends at 44523021. So I wrote a little script to extract all proteins that belong to some chromosome and then sort them according to their start position. However, I noticed something that ruined the solution.
For some human protein the header is something like:
ENSP00000487931.1 pep:known chromosome:GRCh38:CHR_HSCHR19_2_CTG3_1:34365020:34377596:1 gene:ENSG00000282019.1 transcript:ENST00000632809.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:GPI description:glucose-6-phosphate isomerase [Source:HGNC Symbol;Acc:HGNC:4458]
As you can see it says this protein belongs to chromosome "CHR_HSCHR19_2_CTG3_1" which is not clear to me. This makes it very difficult to extract the proteins for each chromosome. What does it mean? Homo Sapien Chromosome 19 Contig 3? What do _2_ and _1 stand for?!