How to retrieve protein sequence from gene ID and output a fasta file
1
1
Entering edit mode
3.0 years ago
minifoog ▴ 10

I want to receive the protein sequence of the following gene IDs and output a fasta file with the sequences with its identifier.

handle = Entrez.esearch(db="gene",
                        term="primate[Orgn] AND TNF[Gene Name]",
                        idtype="acc",
                        retmax='50',
                        )
record = Entrez.read(handle)
idlist = record['IdList']
print(idlist)

But I am not sure where to go from here. Any help would be appreciated.

ncbi gene protein biopython entrez • 1.5k views
ADD COMMENT
3
Entering edit mode
3.0 years ago
GenoMax 141k

Using command line EntrezDirect (truncated for space) :

$ esearch -db gene -query "primate [orgn] AND TNF [gene]" | elink -target protein | efetch -format fasta > tnf.fa
$ more tnf.fa
>sp|Q19LH4.1|TNFA_CALJA RecName: Full=Tumor necrosis factor; AltName: Full=Cachectin; AltName: Full=TNF-alpha; AltName: Full=Tumor necrosis factor ligand superfamily member 2; Short=TNF-a; Contains: RecName: Full=Tumor necrosis factor, membrane form; AltName: Full=N-terminal fragment; Short=NTF; Contains: RecName: Full=Intracellular domain 1; Short=ICD1; Contains: RecName: Full=Intracellular domain 2; Short=ICD2; Contains: RecName: Full=C-domain 1; Contains: RecName: Full=C-domain 2; Contains: RecName: Full=Tumor necrosis factor, soluble form; Flags: Precursor
MSTETMIQDVELAEEALPKTRGPQGSKRRLFLSLFSFLLVAGATALFCLLHFGVIGPQKDELSKDFSLIS
PLALAVRSSSRIPSDKPVAHVVANPQAEGQLQWLNRRANALLANGVELRDNQLVVPSEGLYLVYSQVLFK
GQGCPSNFMLLTHSISRIAVSYQAKVNLLSAIKSPCQRETPQGAKTNPWYEPIYLGGVFQLEKGDRLSAE
INLPDYLDLAESGQVYFGIIGL
>sp|P48094.1|TNFA_MACMU RecName: Full=Tumor necrosis factor; AltName: Full=Cachectin; AltName: Full=TNF-alpha; AltName: Full=Tumor necrosis factor ligand superfamily member 2; Short=TNF-a; Contains: RecName: Full=Tumor necrosis factor, membrane form; AltName: Full=N-terminal fragment; Short=NTF; Contains: RecName: Full=Intracellular domain 1; Short=ICD1; Contains: RecName: Full=Intracellular domain 2; Short=ICD2; Contains: RecName: Full=C-domain 1; Contains: RecName: Full=C-domain 2; Contains: RecName: Full=Tumor necrosis factor, soluble form; Flags: Precursor
MSTESMIRDVELAEEALPRKTAGPQGSRRCWFLSLFSFLLVAGATTLFCLLHFGVIGPQREEFPKDPSLI
SPLAQAVRSSSRTPSDKPVAHVVANPQAEGQLQWLNRRANALLANGVELTDNQLVVPSEGLYLIYSQVLF
KGQGCPSNHVLLTHTISRIAVSYQTKVNLLSAIKSPCQRETPEGAEAKPWYEPIYLGGVFQLEKGDRLSA

If you want to save individual sequence in a separate file then use:

$ esearch -db gene -query "primate [orgn] AND TNF [gene]" | elink -target protein | efetch -format acc | xargs -n 1 sh -c 'efetch -db protein -id "$0" -format fasta > "$0".fa'
ADD COMMENT
0
Entering edit mode

This works thanks so much!

ADD REPLY

Login before adding your answer.

Traffic: 3101 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6