Question

How to retrieve protein sequence from gene ID and output a fasta file

1

Entering edit mode

3.0 years ago

minifoog ▴ 10

I want to receive the protein sequence of the following gene IDs and output a fasta file with the sequences with its identifier.

handle = Entrez.esearch(db="gene",
                        term="primate[Orgn] AND TNF[Gene Name]",
                        idtype="acc",
                        retmax='50',
                        )
record = Entrez.read(handle)
idlist = record['IdList']
print(idlist)

But I am not sure where to go from here. Any help would be appreciated.

ncbi gene protein biopython entrez • 1.5k views

ADD COMMENT • link updated 3.0 years ago by GenoMax 141k • written 3.0 years ago by minifoog ▴ 10

score 3 · Accepted Answer · 2021-04-17

Using command line EntrezDirect (truncated for space) :

$ esearch -db gene -query "primate [orgn] AND TNF [gene]" | elink -target protein | efetch -format fasta > tnf.fa
$ more tnf.fa
>sp|Q19LH4.1|TNFA_CALJA RecName: Full=Tumor necrosis factor; AltName: Full=Cachectin; AltName: Full=TNF-alpha; AltName: Full=Tumor necrosis factor ligand superfamily member 2; Short=TNF-a; Contains: RecName: Full=Tumor necrosis factor, membrane form; AltName: Full=N-terminal fragment; Short=NTF; Contains: RecName: Full=Intracellular domain 1; Short=ICD1; Contains: RecName: Full=Intracellular domain 2; Short=ICD2; Contains: RecName: Full=C-domain 1; Contains: RecName: Full=C-domain 2; Contains: RecName: Full=Tumor necrosis factor, soluble form; Flags: Precursor
MSTETMIQDVELAEEALPKTRGPQGSKRRLFLSLFSFLLVAGATALFCLLHFGVIGPQKDELSKDFSLIS
PLALAVRSSSRIPSDKPVAHVVANPQAEGQLQWLNRRANALLANGVELRDNQLVVPSEGLYLVYSQVLFK
GQGCPSNFMLLTHSISRIAVSYQAKVNLLSAIKSPCQRETPQGAKTNPWYEPIYLGGVFQLEKGDRLSAE
INLPDYLDLAESGQVYFGIIGL
>sp|P48094.1|TNFA_MACMU RecName: Full=Tumor necrosis factor; AltName: Full=Cachectin; AltName: Full=TNF-alpha; AltName: Full=Tumor necrosis factor ligand superfamily member 2; Short=TNF-a; Contains: RecName: Full=Tumor necrosis factor, membrane form; AltName: Full=N-terminal fragment; Short=NTF; Contains: RecName: Full=Intracellular domain 1; Short=ICD1; Contains: RecName: Full=Intracellular domain 2; Short=ICD2; Contains: RecName: Full=C-domain 1; Contains: RecName: Full=C-domain 2; Contains: RecName: Full=Tumor necrosis factor, soluble form; Flags: Precursor
MSTESMIRDVELAEEALPRKTAGPQGSRRCWFLSLFSFLLVAGATTLFCLLHFGVIGPQREEFPKDPSLI
SPLAQAVRSSSRTPSDKPVAHVVANPQAEGQLQWLNRRANALLANGVELTDNQLVVPSEGLYLIYSQVLF
KGQGCPSNHVLLTHTISRIAVSYQTKVNLLSAIKSPCQRETPEGAEAKPWYEPIYLGGVFQLEKGDRLSA

If you want to save individual sequence in a separate file then use:

$ esearch -db gene -query "primate [orgn] AND TNF [gene]" | elink -target protein | efetch -format acc | xargs -n 1 sh -c 'efetch -db protein -id "$0" -format fasta > "$0".fa'