Question

Download customized list of human protein sequence

0

Entering edit mode

15 months ago

Angelina_G ▴ 10

Hello, I have a list of ~1000 human gene names (currently in csv format) and I would like to download their protein amino acid sequence in fasta format, in one single file. I would like to have all protein isoforms of that gene included in my fasta file.

I know that NCBI has the sequence data but I'm not sure how to download them in one go?

Thank you!

Edit: Seems like Uniport could do the work, but requires PID rather than gene name... and I'm not sure how to do mass PID retrieval by gene name. Currently trying: Download full protein sequence from ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz, then filter those with OS=Homo sapiens and GN=gene name in my excel... Wonder if there's faster ways?

ncbi sequence protein • 1.2k views

ADD COMMENT • link updated 15 months ago by MirianT_NCBI ▴ 720 • written 15 months ago by Angelina_G ▴ 10

score 0 · Answer 1 · 2023-01-31

0

Entering edit mode

15 months ago

barslmn ★ 2.1k

Here are some results from the search:

ADD COMMENT • link 15 months ago by barslmn ★ 2.1k

score 0 · Answer 2 · 2023-01-31

You can upload your list of gene names to UniProt's IDmapping/Batch retrieval service at https://www.uniprot.org/id-mapping

There is an option to map from Gene names to UniProtKB. It is highly recommended to specify the organism information (human in your case) to make your results specific. Once you have your results, you can download them in various formats including FASTA.

Please don't hesitate to contact the UniProt helpdesk if you have any other questions.

score 0 · Answer 3 · 2023-01-31

0

Entering edit mode

15 months ago

GenoMax 141k

Since you asked about NCBI - you should be able to do this using NCBI datasets tool. You could use the web page I linked or there is a command line version available.

BioMart at Ensembl (or R package) should also be able to do this as well.

ADD COMMENT • link 15 months ago by GenoMax 141k

score 0 · Answer 4 · 2023-01-31

Hi! As Genomax mentioned, you can use NCBI Datasets.

Here's how to do that starting from a file with a list of human genes (genes.txt), one per line:

tp53
brca1
mc1r

You can use the following command to download all proteins as a single file:

datasets download gene symbol --inputfile genes.txt --include protein --taxon human --filename human-proteins.zip

unzip human-proteins.zip -d human-proteins
Archive:  human-proteins.zip
  inflating: human-proteins/README.md  
  inflating: human-proteins/ncbi_dataset/data/protein.faa  
  inflating: human-proteins/ncbi_dataset/data/data_report.jsonl  
  inflating: human-proteins/ncbi_dataset/data/dataset_catalog.json

There are a few additional options that might be useful for you:

You can download other files in addition to the protein sequences. Below you have a list of available files that can be added using the --include flag. By default (aka. without using this flag), rna and protein sequence files are included.
- gene: gene sequence
- rna: transcript
- protein: amino acid sequences
- cds: nucleotide coding sequences
- 5p-utr: 5'-UTR
- 3p-utr: 3'-UTR
- product-report: gene transcript and protein locations and metadata
If you want each protein as a separate FASTA file, you need to loop over the list of symbols and download one zip archive for each. Like this:

cat symbols.txt | while read GENE; do 
    datasets download gene symbol "$GENE" --taxon human --include protein --filename "$GENE".zip; 
done

Collecting 1  records [================================================] 100% 1/1
Downloading: tp53.zip    3.08kB done
Collecting 1  records [================================================] 100% 1/1
Downloading: brca1.zip    29kB done
Collecting 1  records [================================================] 100% 1/1
Downloading: mc1r.zip    2.61kB done

In that case, each data package is named after the gene name/symbol and inside you have all isoforms of each protein. To rename them in a way that might make more sense to you, take a look at this post here.

I hope it helps! Feel free to reach out if run into any issues :)