Question: Need help to retrive sequences
0
gravatar for Jason
5 weeks ago by
Jason0
Jason0 wrote:

I have no experince with bioinformatics

I have list of ID sequences arround 200 in file1.txt as below /n

script with (python,shell,etc)

sp|Q66LE6|2ABD_HUMAN

sp|Q9UKV3|ACINU_HUMAN

I want to retrive overall sequences for these ID and save it in txt file or fasta file the result should look like below:

>sp|P61247.2|RS3A_HUMAN RecName: Full=40S ribosomal protein S3a; AltName: Full=Small ribosomal subunit protein eS1; AltName: Full=v-fos transformation effector protein; Short=Fte-1
MAVGKNKRLTKGGKKGAKKKVVDPFSKKDWYDVKAPAMFNIRNIGKTLVTRTQGTKIASDGLKGRVFEVS
LADLQNDEVAFRKFKLITEDVQGKNCLTNFHGMDLTRDKMCSMVKKWQTMIEAHVDVKTTDGYLLRLFCV
GFTKKRNNQIRKTSYAQHQQVRQIRKKMMEIMTREVQTNDLKEVVNKLIPDSIGKDIEKACQSIYPLHDV
FVRKVKMLKKPKFELGKLMELHGEGSSSGKATGDETGAKVERADGYEPPVQESV


>sp|Q96EB6.2|SIR1_HUMAN RecName: Full=NAD-dependent protein deacetylase sirtuin-1; Short=hSIRT1; AltName: Full=Regulatory protein SIR2 homolog 1; AltName: Full=SIR2-like protein 1; Short=hSIR2; Contains: RecName: Full=SirtT1 75 kDa fragment; Short=75SirT1
MADEAALALQPGGSPSAAGADREAASSPAGEPLRKRPRRDGPGLERSPGEPGGAAPEREVPAAARGCPGA
AAAALWREAEAEAAAAGGEQEAQATAAAGEGDNGPGLQGPSREPPLADNLYDEDDDDEGEEEEEAAAAAI
GYRDNLLFGDEIITNGFHSCESDEEDRASHASSSDWTPRPRIGPYTFVQQHLMIGTDPRTILKDLLPETI
sequence • 196 views
ADD COMMENTlink modified 5 weeks ago by Pierre Lindenbaum101k • written 5 weeks ago by Jason0

Hello Jason, can you give more precisions? You want the protein sequences from PDB, NCBI, ENSEMBL, UNIPROT, ... ? Several databases have API which allow you to extract some data using ID as entry point. ;)

ADD REPLYlink written 5 weeks ago by glihm530
2
gravatar for Pierre Lindenbaum
5 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum101k wrote:
cut -f 2 -d '|' file1.txt | while read A; do wget -q -O - "http://www.uniprot.org/uniprot/${A}.fasta" ; done
ADD COMMENTlink written 5 weeks ago by Pierre Lindenbaum101k
1
gravatar for Kevin Blighe
5 weeks ago by
Kevin Blighe7.2k
Republic of Ireland (Éire)
Kevin Blighe7.2k wrote:

This Python script will extract FASTA protein records relating to your IDs of the form Q66LE6, Q9UKV3, etc. This assumes that everything is human. You can also search using 2ABD, ACINU, etc., but it returns more hits

import sys
sys.path.append('/usr/local/lib/python2.7/dist-packages/')

import argparse

from Bio import Entrez

parser = argparse.ArgumentParser(description='Searches for a human protein sequence by any provided ID or accession number.')
parser.add_argument('-f', action='store', dest='SearchTerms', required=True, help='The column number containing the search terms in the provided file (starting at 1).')
parser.add_argument('-e', action='store', dest='EmailAddress', required=True, help='Entrez requires your email address.')
parser.add_argument('InputFile', help='Input file')

arguments = parser.parse_args()

Entrez.email = arguments.EmailAddress

iSearchTerm_Col = int(arguments.SearchTerms) - 1

with open(arguments.InputFile, 'r') as InputFile:

    for line in InputFile:

        LookupTerm = line.split()[iSearchTerm_Col]

        LookupCommand = 'refseq[FILTER] AND txid9606[Organism] AND {}'.format(LookupTerm)

        handle = Entrez.esearch(db='protein', term=LookupCommand)

        results = Entrez.read(handle)

        handle.close()

        #Lookup the FASTA sequence for each protein by its GeneInfo Identifier (GI) number
        for gi in results['IdList']:
            handle = Entrez.efetch(db='protein', id=gi, rettype='fasta')

            print handle.read()

            handle.close()

Execute it as follows: python ProteinSearch.py -f 1 -e myemail@gmail.com proteinsearch.list

proteinsearch.list contains a single list of your IDs:

Q66LE6

Q9UKV3

...

..

">NP_060931.2 serine/threonine-protein phosphatase 2A 55 kDa regulatory subunit B delta isoform isoform a [Homo sapiens] MAGAGGGGCPAGGNDFQWCFSQVKGAIDEDVAEADIISTVEFNYSGDLLATGDKGGRVVIFQREQENKSR PHSRGEYNVYSTFQSHEPEFDYLKSLEIEEKINKIRWLPQQNAAHFLLSTNDKTIKLWKISERDKRAEGY NLKDEDGRLRDPFRITALRVPILKPMDLMVEASPRRIFANAHTYHINSISVNSDHETYLSADDLRINLWH LEITDRSFNIVDIKPANMEELTEVITAAEFHPHQCNVFVYSSSKGTIRLCDMRSSALCDRHSKFFEEPED PSSRSFFSEIISSISDVKFSHSGRYMMTRDYLSVKVWDLNMESRPVETHQVHEYLRSKLCSLYENDCIFD KFECCWNGSDSAIMTGSYNNFFRMFDRDTRRDVTLEASRESSKPRASLKPRKVCTGGKRRKDEISVDSLD FNKKILHTAWHPVDNVIAVAATNNLYIFQDKIN

">NP_001158286.1 apoptotic chromatin condensation inducer in the nucleus isoform 2 [Homo sapiens] MWRRKHPRTSGGTRGVLSGNRGVEYGSGRGHLGTFEGRWRKLPKMPEAVGTDPSTSRKMAELEEVTLDGK PLQALRVTDLKAALEQRGLAKSGQKSALVKRLKGALMLENLQKHSTPHAAFQPNSQIGEEMSQNSFIKQY LEKQQELLRQRLEREAREAAELEEASAESEDEMIHPEGVASLLPPDFQSSLERPELELSRHSPRKSSSIS EEKGDSDDEKPRKGERRSSRVRQARAAKLSEGSQPAEEEEDQETPSRNLRVRADRNLKTEEEEEEEEEEE EDDEEEEGDDEGQKSREAPILKEFKEEGEEIPRVKPEEMMDERPKTRSQEQEVLERGGRFTRSQEEARKS HLARQQQEKEMKTTSPLEEEEREIKSSQGLKEKSKSPSPPRLTEDRKKASLVALPEQTASEEETPPPLLT KEASSPPPHPQLHSEEEIEPMEGPAPPVLIQLSPPNTDADTRELLVSQHTVQLVGGLSPLSSPSDTKAES PAEKVPEESVLPLVQKSTLADYSAQKDLEPESDRSAQPLPLKIEELALAKGITEECLKQPSLEQKEGRRA SHTLLPSHRLKQSADSSSSRSSSSSSSSSRSRSRSPDSSGSRSHSPLRSKQRDVAQARTHANPRGRPKMG SRSTSESRSRSRSRSRSASSNSRKSLSPGVSRDSSTSYTETKDPSSGQEVATPPVPQLQVCEPKERTSTS SSSVQARRLSQPESAEKHVTQRLQPERGSPKKCEAEEAEPPAATQPQTSETQTSHLPESERIHHTVEEKE EVTMDTSENRPENDVPEPPMPIADQVSNDDRPEGSVEDEEKKESSLPKSFKRKISVVSTKGVPAGNSDTE GGQPGRKRRWGASTATTQKKPSISITTESLKEAVVDLHADDSRISEDETERNGDDGTHDKGLKICRTVTQ VVPAEGQENGQREEEEEEKEPEAEPPVPPQVSVEVALPPPAEHEVKKVTLGDTLTRRSISQQKSGVSITI DDPVRTAQVPSPPRGKISNIVHISNLVRPFTLGQLKELLGRTGTLVEEAFWIDKIKSHCFVTYSTVEEAV ATRTALHGVKWPQSNPKFLCADYAEQDELDYHRGLLVDRPSETKTEEQGIPRPLHPPPPPPVQPPQHPRA EQREQERAVREQWAEREREMERRERTRSEREWDRDKVREGPRSRSRSRDRRRKERAKSKEKKSEKKEKAQ EEPPAKLLDDLFRKTKAAPCIYWLPLTDSQIVQKEAERAERAKEREKRRKEQEEEEQKEREKEAERERNR QLEREKRREHSRERDRERERERERDRGDRDRDRERDRERGRERDRRDTKRHSRSRSRSTPVRDRGGRR

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by Kevin Blighe7.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1381 users visited in the last hour