Question: Getting all Ensembl protein lengths
1
gravatar for Natasha Latysheva
3.7 years ago by
United Kingdom
Natasha Latysheva50 wrote:

Hi all,

I have a very simple but massively frustrating problem. I am interested in just getting all of the lengths (number of amino acid residues) associated with Ensembl proteins.

The BioMart interface to Ensembl has plenty of information, but not protein lengths. The biomaRt package for R looks great, but doesn't have protein lengths. This is madness!

Things I've tried:
- Converting a list of all ENSPs to Uniprot IDs and getting the protein lengths from Uniprot. This doesn't work well since a single Uniprot accession can correspond to several ENSPs, and I'm interested in having all the isoform information.
- Getting the CDS lengths out of BioMart, and noting that they're about 3 times the protein lengths. This isn't precise enough, though.
- Getting ENSP sequences out of BioMart with the aim of counting the amino acids to get the length. This is possible to script so I'm happy to do this if there are no other options.

I'm sure others have encountered the same issue. What do you all think?

---- Edit: still haven't found a direct source of protein length info, but here's some code that calculates them given a text file of ENSP accessions

# library("BiocInstaller")
# biocLite("biomaRt")
library(biomaRt)

# load ensp table
ensp <- read.table("~/Projects/chimera_project/analyses/pfam_smart_domain_mapping/all_ensp.txt", header=TRUE)
ensp <- ensp$ensembl_protein_id
head(ensp); length(ensp)

# set up mart/dataset
ensembl.human = useMart(biomart="ensembl", dataset = "hsapiens_gene_ensembl")

# see how long it takes to fetch sequences of ensps in list
start_time <- proc.time()
human.prot = getSequence(id=head(ensp, 500), mart=ensembl.human, seqType=c("peptide"), type="ensembl_peptide_id")
proc.time() - start_time

# what have we got here
head(human.prot)
head(human.prot[1])
human.prot[[2]]

# reformat result and calculate protein lengths
result <- as.data.frame(cbind(human.prot[1], human.prot[2]))
result$length <- nchar(result$peptide) - 1
head(result)

# write out
write.table(x=result, sep='\t', file="~/Projects/chimera_project/analyses/pfam_smart_domain_mapping/ensp_sequences_with_length.txt")

R protein biomart ensembl • 2.2k views
ADD COMMENTlink modified 15 days ago by Shicheng Guo7.4k • written 3.7 years ago by Natasha Latysheva50
2
gravatar for Nicolas Rosewick
3.7 years ago by
Belgium, Brussels
Nicolas Rosewick7.5k wrote:

As you say, get protein sequences in fasta format, open it in R and compute the length of each protein sequences using a mix of sapply and nchar. something like : protLength <- sapply(protSeq,nchar)

 

ADD COMMENTlink written 3.7 years ago by Nicolas Rosewick7.5k

Thanks :) Length info should really be part of these repositories, but you're right that one can just fairly quickly compute it themselves.

ADD REPLYlink written 3.7 years ago by Natasha Latysheva50
0
gravatar for Shicheng Guo
15 days ago by
Shicheng Guo7.4k
Shicheng Guo7.4k wrote:

Step by Step

  1. download all the protein sequence from uniprot

  2. use perl code to calculate protein length for each gene. gene symbol is saved in GN=Symbol

  3. You can find protein length for all gene in this link which I finished in 2019

  4. Also you can download from my cloud disk:https://pan.baidu.com/s/1OyDdaqMuybu3ipnhpIGSWA and passwd: wg96

    open F,"uniprot-proteome_UP000005640.fasta";
     my $protein;
     while(<F>){
        if(/^>.*GN=(\w*)/){
        my $len=length($protein);
        print "$len\n$1\t";
        $protein="";
        }else{
        $protein .=$_;
        }
    }
    
ADD COMMENTlink modified 15 days ago • written 15 days ago by Shicheng Guo7.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1205 users visited in the last hour