Question: Need help to retrive sequences
0
gravatar for Jason
11 months ago by
Jason0
Jason0 wrote:

I have no experince with bioinformatics

I have list of ID sequences arround 200 in file1.txt as below /n

script with (python,shell,etc)

sp|Q66LE6|2ABD_HUMAN

sp|Q9UKV3|ACINU_HUMAN

I want to retrive overall sequences for these ID and save it in txt file or fasta file the result should look like below:

>sp|P61247.2|RS3A_HUMAN RecName: Full=40S ribosomal protein S3a; AltName: Full=Small ribosomal subunit protein eS1; AltName: Full=v-fos transformation effector protein; Short=Fte-1
MAVGKNKRLTKGGKKGAKKKVVDPFSKKDWYDVKAPAMFNIRNIGKTLVTRTQGTKIASDGLKGRVFEVS
LADLQNDEVAFRKFKLITEDVQGKNCLTNFHGMDLTRDKMCSMVKKWQTMIEAHVDVKTTDGYLLRLFCV
GFTKKRNNQIRKTSYAQHQQVRQIRKKMMEIMTREVQTNDLKEVVNKLIPDSIGKDIEKACQSIYPLHDV
FVRKVKMLKKPKFELGKLMELHGEGSSSGKATGDETGAKVERADGYEPPVQESV


>sp|Q96EB6.2|SIR1_HUMAN RecName: Full=NAD-dependent protein deacetylase sirtuin-1; Short=hSIRT1; AltName: Full=Regulatory protein SIR2 homolog 1; AltName: Full=SIR2-like protein 1; Short=hSIR2; Contains: RecName: Full=SirtT1 75 kDa fragment; Short=75SirT1
MADEAALALQPGGSPSAAGADREAASSPAGEPLRKRPRRDGPGLERSPGEPGGAAPEREVPAAARGCPGA
AAAALWREAEAEAAAAGGEQEAQATAAAGEGDNGPGLQGPSREPPLADNLYDEDDDDEGEEEEEAAAAAI
GYRDNLLFGDEIITNGFHSCESDEEDRASHASSSDWTPRPRIGPYTFVQQHLMIGTDPRTILKDLLPETI
sequence • 501 views
ADD COMMENTlink modified 11 months ago by Pierre Lindenbaum112k • written 11 months ago by Jason0

Hello Jason, can you give more precisions? You want the protein sequences from PDB, NCBI, ENSEMBL, UNIPROT, ... ? Several databases have API which allow you to extract some data using ID as entry point. ;)

ADD REPLYlink written 11 months ago by glihm560
2
gravatar for Pierre Lindenbaum
11 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum112k wrote:
cut -f 2 -d '|' file1.txt | while read A; do wget -q -O - "http://www.uniprot.org/uniprot/${A}.fasta" ; done
ADD COMMENTlink written 11 months ago by Pierre Lindenbaum112k
1
gravatar for Kevin Blighe
11 months ago by
Kevin Blighe28k
USA / Europe / Brazil
Kevin Blighe28k wrote:

This Python script will extract FASTA protein records relating to your IDs of the form Q66LE6, Q9UKV3, etc. This assumes that everything is human. You can also search using 2ABD, ACINU, etc., but it returns more hits

import sys
sys.path.append('/usr/local/lib/python2.7/dist-packages/')

import argparse

from Bio import Entrez

parser = argparse.ArgumentParser(description='Searches for a human protein sequence by any provided ID or accession number.')
parser.add_argument('-f', action='store', dest='SearchTerms', required=True, help='The column number containing the search terms in the provided file (starting at 1).')
parser.add_argument('-e', action='store', dest='EmailAddress', required=True, help='Entrez requires your email address.')
parser.add_argument('InputFile', help='Input file')

arguments = parser.parse_args()

Entrez.email = arguments.EmailAddress

iSearchTerm_Col = int(arguments.SearchTerms) - 1

with open(arguments.InputFile, 'r') as InputFile:

    for line in InputFile:

        LookupTerm = line.split()[iSearchTerm_Col]

        LookupCommand = 'refseq[FILTER] AND txid9606[Organism] AND {}'.format(LookupTerm)

        handle = Entrez.esearch(db='protein', term=LookupCommand)

        results = Entrez.read(handle)

        handle.close()

        #Lookup the FASTA sequence for each protein by its GeneInfo Identifier (GI) number
        for gi in results['IdList']:
            handle = Entrez.efetch(db='protein', id=gi, rettype='fasta')

            print handle.read()

            handle.close()

Execute it as follows: python ProteinSearch.py -f 1 -e myemail@gmail.com proteinsearch.list

proteinsearch.list contains a single list of your IDs:

Q66LE6

Q9UKV3

...

..

">NP_060931.2 serine/threonine-protein phosphatase 2A 55 kDa regulatory subunit B delta isoform isoform a [Homo sapiens] MAGAGGGGCPAGGNDFQWCFSQVKGAIDEDVAEADIISTVEFNYSGDLLATGDKGGRVVIFQREQENKSR PHSRGEYNVYSTFQSHEPEFDYLKSLEIEEKINKIRWLPQQNAAHFLLSTNDKTIKLWKISERDKRAEGY NLKDEDGRLRDPFRITALRVPILKPMDLMVEASPRRIFANAHTYHINSISVNSDHETYLSADDLRINLWH LEITDRSFNIVDIKPANMEELTEVITAAEFHPHQCNVFVYSSSKGTIRLCDMRSSALCDRHSKFFEEPED PSSRSFFSEIISSISDVKFSHSGRYMMTRDYLSVKVWDLNMESRPVETHQVHEYLRSKLCSLYENDCIFD KFECCWNGSDSAIMTGSYNNFFRMFDRDTRRDVTLEASRESSKPRASLKPRKVCTGGKRRKDEISVDSLD FNKKILHTAWHPVDNVIAVAATNNLYIFQDKIN

">NP_001158286.1 apoptotic chromatin condensation inducer in the nucleus isoform 2 [Homo sapiens] MWRRKHPRTSGGTRGVLSGNRGVEYGSGRGHLGTFEGRWRKLPKMPEAVGTDPSTSRKMAELEEVTLDGK PLQALRVTDLKAALEQRGLAKSGQKSALVKRLKGALMLENLQKHSTPHAAFQPNSQIGEEMSQNSFIKQY LEKQQELLRQRLEREAREAAELEEASAESEDEMIHPEGVASLLPPDFQSSLERPELELSRHSPRKSSSIS EEKGDSDDEKPRKGERRSSRVRQARAAKLSEGSQPAEEEEDQETPSRNLRVRADRNLKTEEEEEEEEEEE EDDEEEEGDDEGQKSREAPILKEFKEEGEEIPRVKPEEMMDERPKTRSQEQEVLERGGRFTRSQEEARKS HLARQQQEKEMKTTSPLEEEEREIKSSQGLKEKSKSPSPPRLTEDRKKASLVALPEQTASEEETPPPLLT KEASSPPPHPQLHSEEEIEPMEGPAPPVLIQLSPPNTDADTRELLVSQHTVQLVGGLSPLSSPSDTKAES PAEKVPEESVLPLVQKSTLADYSAQKDLEPESDRSAQPLPLKIEELALAKGITEECLKQPSLEQKEGRRA SHTLLPSHRLKQSADSSSSRSSSSSSSSSRSRSRSPDSSGSRSHSPLRSKQRDVAQARTHANPRGRPKMG SRSTSESRSRSRSRSRSASSNSRKSLSPGVSRDSSTSYTETKDPSSGQEVATPPVPQLQVCEPKERTSTS SSSVQARRLSQPESAEKHVTQRLQPERGSPKKCEAEEAEPPAATQPQTSETQTSHLPESERIHHTVEEKE EVTMDTSENRPENDVPEPPMPIADQVSNDDRPEGSVEDEEKKESSLPKSFKRKISVVSTKGVPAGNSDTE GGQPGRKRRWGASTATTQKKPSISITTESLKEAVVDLHADDSRISEDETERNGDDGTHDKGLKICRTVTQ VVPAEGQENGQREEEEEEKEPEAEPPVPPQVSVEVALPPPAEHEVKKVTLGDTLTRRSISQQKSGVSITI DDPVRTAQVPSPPRGKISNIVHISNLVRPFTLGQLKELLGRTGTLVEEAFWIDKIKSHCFVTYSTVEEAV ATRTALHGVKWPQSNPKFLCADYAEQDELDYHRGLLVDRPSETKTEEQGIPRPLHPPPPPPVQPPQHPRA EQREQERAVREQWAEREREMERRERTRSEREWDRDKVREGPRSRSRSRDRRRKERAKSKEKKSEKKEKAQ EEPPAKLLDDLFRKTKAAPCIYWLPLTDSQIVQKEAERAERAKEREKRRKEQEEEEQKEREKEAERERNR QLEREKRREHSRERDRERERERERDRGDRDRDRERDRERGRERDRRDTKRHSRSRSRSTPVRDRGGRR

ADD COMMENTlink modified 11 months ago • written 11 months ago by Kevin Blighe28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1745 users visited in the last hour