Question about searching for/manipulating FASTA sequences in Biopython
2
0
Entering edit mode
7.1 years ago
jon.sy.tarn ▴ 10

Hi all;

sorry for my lack of knowledge. I'm new to biopython and looking through the manual, there's so much information I don't know where to begin. Basically, my FASTA files look like this:

>Ribosomal_L16___Rinke_et_al___edaecdd15ca8bf2bacd09a738e78218887c932d5f415ce2dd05b8fa9|bin_id:Acidianus_Hospitalis_W1.fixed|source:Rinke_et_al|e_value:3.2e-29|contig:c_000000000001|gene_callers_id:1239|start:1112992|stop:1113442|length:450
ATGCCAAAA…

I'm looking to take a list of protein names and I'm trying to search for, find, and print out the resulting sequence from said name. That is to say, if I were to look up "Ribosomal_L16", it would print out "ATGCCAAAA…".

This seems simple enough, except for that most of the title outside of the protein name (i.e. "Ribosomal_L16") is completely variable. Does anyone have a good place for me to start thinking about how to solve this issue?

thanks so much.

sequence biopython • 1.5k views
ADD COMMENT
1
Entering edit mode

You should be more informative with regard to your fasta input file.
It's unclear if the protein name (Ribosomal_L16 etc) is:

  • Always first in the fasta identifier (python: string.startswith())
  • Always separated by ___ (python: string.split('___')[0])
  • Present once, multiple times or potential absent from your fasta file
  • ...
ADD REPLY
0
Entering edit mode

I've recently learned how to use biopython, and python for that matter. Sometimes I found myself overthinking the problem I had, and thought it was a biology problem. There is a much easier way to approach this by string matching. If the '__' is conserved, you could split the name at that point, and search for that part of the name (it might be sequence.name, i don't remember).

protein_name = str ( sequence.id).split('__')[0]
ADD REPLY
0
Entering edit mode

I'm new to biopython and looking through the manual Does anyone have a good place for me to start thinking about how to solve this issue?

Have you written a snippet of python code you can show?

ADD REPLY
0
Entering edit mode
7.1 years ago
Ram 43k

I'd look at Heng Li's bioawk - it combines awk with built in parsers for biological formats, so you can use regular expressions to sift through FASTA components such as the title.

ADD COMMENT
0
Entering edit mode
7.1 years ago

try seqkit:

seqkit grep -n -r  -i -f protein_names.txt seqs.fasta | seqkit seq -s -w 0
  • seqkit grep -n -r -i -f protein_names.txt means searching FASTA records (grep) of which the full header (-n/--by-name) match any of regular expression (-r/--use-regexp) from files (-f/--pattern-file), cases ignored (-i/--ignore-case)
  • seqkit seq -s -w 0 means only print sequences (-s/--seq), and do not wrap sequences to fixed length (-w/--line-width)
ADD COMMENT
0
Entering edit mode

I'd recommend not giving out close-to-exact commands - users are better off learning tools on their own. Especially here, when the OP says

Does anyone have a good place for me to start thinking about how to solve this issue?

giving a command is unadvisable.

ADD REPLY
0
Entering edit mode

Given this, advices on the specific programming language, library, and programming pattern are the best appropriate solution, so is . :P

ADD REPLY
0
Entering edit mode

thanks, the seqkit tip is definitely helpful! I'll try it out.

ADD REPLY

Login before adding your answer.

Traffic: 1873 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6