Forum:merge Header by extracting protein sequence
1
0
Entering edit mode
5.1 years ago
solasol • 0

Hi everyone,

This is my first question on Biostars and I hope I could get some help regarding this issue.

I have two files:

File A : which contains FASTA sequence file (protein Format)

Example for File A

File B: list of X number of proteins (without their sequences)

Example for File B:

Protein IDs
SEN0002-thrB-missing_gene_synonym_qualifer-CAR31593.1-homoserine kinase-2565:3494 Forward
SEN0003-thrC-missing_gene_synonym_qualifer-CAR31594.1-threonine synthase-3498:4784 Forward
SEN0004-yaaA-missing_gene_synonym_qualifer-CAR31595.1-conserved hypothetical protein-4878:5651 Reverse
SEN0006-talB-missing_gene_synonym_qualifer-CAR31597.1-transaldolase B-7429:8382 Forward
SEN0007-mog-missing_gene_synonym_qualifer-CAR31598.1-molybdopterin biosynthesis Mog protein-8493:9083 Forward
SEN0011-dnaK-missing_gene_synonym_qualifer-CAR31602.1-DnaK protein (heat shock protein 70)-11358:13274 Forward
SEN0012-dnaJ-missing_gene_synonym_qualifer-CAR31603.1-DnaJ protein-13360:14499 Forward
SEN0043-rpsT-missing_gene_synonym_qualifer-CAR31634.1-30S ribosomal protein S20-52034:52297 Reverse
SEN0046-ileS-missing_gene_synonym_qualifer-CAR31637.1-isoleucyl-tRNA synthetase-53609:56443 Forward
SEN0048-slpA-missing_gene_synonym_qualifer-CAR31639.1-probable FkbB-type 16 kD peptidyl-prolyl cis-trans isomerase-57098:57547 Forward
SEN0065-dapB-missing_gene_synonym_qualifer-CAR31655.1-dihydrodipicolinate reductase-73766:74587 Forward
SEN0066-carA-missing_gene_synonym_qualifer-CAR31656.1-carbamoyl-phosphate synthase small chain-75449:76597 Forward
SEN0067-carB-missing_gene_synonym_qualifer-CAR31657.1-carbamoyl-phosphate synthase large chain-76616:79843 Forward
SEN0089-folA-missing_gene_synonym_qualifer-CAR31676.1-dihydrofolate reductase type I-100408:100887 Forward
SEN0094-surA-missing_gene_synonym_qualifer-CAR31681.1-survival protein SurA precursor-104039:105325 Reverse
SEN0113-leuB-missing_gene_synonym_qualifer-CAR31702.1-3-isopropylmalate dehydrogenase-130762:131853 Reverse
SEN0124-murE-missing_gene_synonym_qualifer-CAR31713.1-UDP-N-acetylmuramoylalanyl-D-glutamate--2,6-dia minopim ligase-143165:144652 Forward
SEN0125-murF-missing_gene_synonym_qualifer-CAR31714.1-UDP-N-acetylmuramoylalanyl-D-glutamyl-2,6-diami nopimelate--D-alan alanyl ligase-144649:146007 Forward


**My question is: how can I merge these 2 files to extract the sequence of each protein of file B from file A. (in this case there is only 20 proteins but I also have cases where I have 1000 proteins!!).

I started a course in Rstudio last week, is there a script to use for this task?

Thank you a lot in advance!

Best!

Solasol

sequence R Forum • 1.2k views
0
Entering edit mode

Welcome to Biostars. What have you tried? Text processing is much simpler in perl, python or Linux.

0
Entering edit mode

Dear Vari

I tired in R but did not manage to make a script!

I barely used R, so for me all this is black box :)

Cheers

0
Entering edit mode

If not R, you can look at this

0
Entering edit mode

If you don't get this sorted out today, just reply to this comment and I will post something in python that you can use to accomplish the task easily

0
Entering edit mode
5.1 years ago
GenoMax 99k

Step 1: Get faSomeRecords utility from Jim Kent at UCSC. (Linux link, OS X or source available).

Step 2: Make the file executable

$chmod u+x faSomeRecords  Step 3: Run faSomeRecords $ ./faSomeRecords
faSomeRecords - Extract multiple fa records
usage:
faSomeRecords in.fa listFile out.fa

• in.fa = Your sequence file
• listfile = file with sequence names
• out.fa = file to store the result
0
Entering edit mode

Dear genomax2,

Sorry It might sound a naive question, but since I have really no experience in all these scripts and softwares, I am a little bit confused.. I clicked on the link you've sent, with which program should I open this file? And where should I Run the faSomeRecords?

0
Entering edit mode

You should save the file linked above (right-click on the link  choose "save as" or use wget to download directly) to a linux machine (this file is meant for use with linux and will not work on windows). The file linked is an executable program and you are going to run it as I showed above.

Do you have access to a linux server/computer? What OS are you using?

0
Entering edit mode

0
Entering edit mode

0
Entering edit mode

If you are going to work with bioinformatics programs it is highly advisable to familiarize yourself with command line (linux). Here is a nice online resource you can use.

You can use a virtual machine (for simple tasks like this) running linux on windows.