Question: Extraction Of Fasta Sequence By Nucleotide Position
2
gravatar for rosarylimyt
5.9 years ago by
rosarylimyt70
rosarylimyt70 wrote:

Hi all,

I'm relatively new in Python and I need help in extraction of FASTA sequences using separate lists of start and end coordinates. Here are my codes below:

import sys
import os
from Bio import SeqIO

infile1 = sys.argv[1] # your original FASTA file to extract sequences from                                                                                                                               
fasta_file = open(infile1)

### To extract your set of start and end coordinates                                                                                                                                                       
start_coordinates = os.system("cat indices_trial | awk '{print $3}'")
end_coordinates = os.system("cat indices_trial | awk '{print $5}'")

results_file = raw_input('Enter a new filename to save your results:')
outfile = open(results_file, 'w')

### Now to tell Python to extract those sequences whose coordinates were not specified in 'start_coordinates' & 'end_coordinates' list                                                                     
for line in SeqIO.parse(fasta_file, "fasta"):
    if line.seq != line.seq["%i-1:%i " % start_coordinates, end_coordinates]:
        print 'Unpredicted gene:' , line.seq , ',' , 'Sequence length =' , len(line.seq)
    else:
        print 'Not found.'

Can some python programmer out there help me?

Rosary

fasta python position nucleotide • 5.3k views
ADD COMMENTlink modified 5.1 years ago by Biostar ♦♦ 20 • written 5.9 years ago by rosarylimyt70

if it were any help, here's the execution and the error message that follows:

1 5

4 10

Enter a new filename to save your results:trial_results Traceback (most recent call last): File "extract_UNpredicted_genes.py", line 17, in <module> if line.seq != line.seq["%i:%i + 1" % start_coordinates, end_coordinates]: TypeError: not enough arguments for format string

ADD REPLYlink written 5.9 years ago by rosarylimyt70

Hi, please show us a few sample lines of infile1.

ADD REPLYlink modified 5.9 years ago • written 5.9 years ago by a.zielezinski8.6k

Hi,

An example of infile1 would be;

Scaffold2 TAAAATGTGAGTTTTTAAATTACTGCGTGCCTCTGGGTAATAATCTCCCCAAAGTAATTC CGATAAATCTTCATAGCTAACCGTTTTATCATGGTTTTTCATGAGCATATAAATAATGCG ACCTTCACTTACGGTGAGATTTACCAGTGTGCCATTTATCCATAGTTCTCTAATTGATGA ACCAAAATGCATTTTACCACAAGTGATGGATAGGTCCTCACCGTTTATTTGCTTTTTAGT TAATAATCGCCTGACTCTGGCTAGTAATTCCATTTGGCGGAACGGTTTAATAATATAATC

followed by scaffold 3,4,5 and so on, they are basically fasta nucleotide sequences.

Thanks!

ADD REPLYlink written 5.9 years ago by rosarylimyt70
3
gravatar for David W
5.9 years ago by
David W4.7k
New Zealand
David W4.7k wrote:

The error message is telling you that the %-style string formatting isn't working. I think if you wrap your coordinates in parentheses you won't get that error anymore.

BUT...

If I can grok what you are trying to do you'll get a bunch of new errors instead!

The % synatx is for creating strings with the arguements inserted. But you want to use the integers themselves to slice your sequence line.seq[start_coordinates:end_coordinates] would do the trick (be aware python uses 0-based indices). Also, if you are just learning python it's porbably a good idea to learn the newer .format()-style string formatting.

It's unusual to all awk to read your indices. You can open a file with open(filename), move through the lines with a for loop , and strip a tab-delimited line into it's elements with line.strip('\t'). Once you have a list of indices you'll need to iterate through the list to get each combination (not, as I think you are trying to do now, use the whole list to do your squence-slices).

Generally, the best way to get to grips with Python and Biopython is (a) read a good tutorial (b) open an interactive prompt and play around a bit. Once you are conformatable with slicing sequences and handling file I/O and the like you'll find building your script much easier.

ADD COMMENTlink modified 5.9 years ago • written 5.9 years ago by David W4.7k

Hey David,

yes I admit this script of mine is indeed clumsy..I'm trying to work on one which you suggested, thanks for the advice!=]

Rosary

ADD REPLYlink written 5.9 years ago by rosarylimyt70
2
gravatar for Jelena Aleksic
5.9 years ago by
Cambridge, UK
Jelena Aleksic900 wrote:

A separate word of caution unrelated to the script itself (and apologies if you know this already - I just thought it was important to mention): some file formats have 0-based coordinates, while others have 1-based ones, and it's pretty random which file format uses which (e.g. BED files are 0-based while GGF files are 1-based). So in order to extract the correct coordinates, it's important to know which format the coordinates from your input file are in. I quote the following nice explalation from the beginning of the SAM file specifications (http://samtools.sourceforge.net/SAM1.pdf):

1-based coordinate system: A coordinate system where the first base of a sequence is one. In this coordinate system, a region is specified by a closed interval. For example, the region between the 3rd and the 7th bases inclusive is [3, 7]. The SAM, GFF and Wiggle formats are using the 1-based coordinate system.

0-based coordinate system: A coordinate system where the first base of a sequence is zero. In this coordinate system, a region is specified by a half-closed-half-open interval. For example, the region between the 3rd and the 7th bases inclusive is [2, 7). The BAM, BED, and PSL formats are using the 0-based coordinate system.

ADD COMMENTlink written 5.9 years ago by Jelena Aleksic900
1

Hey Jelena,

Thanks for the advice! I didn't know of such a thing before =] Really appreciate it.

Rosary

ADD REPLYlink written 5.9 years ago by rosarylimyt70
2
gravatar for Fabio Marroni
5.9 years ago by
Fabio Marroni2.1k
Italy
Fabio Marroni2.1k wrote:

I suggest you use pyfasta. Follow the link, it also has some very simple examples. It's the first python command I used and it was really easy. https://pypi.python.org/pypi/pyfasta/

ADD COMMENTlink written 5.9 years ago by Fabio Marroni2.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 920 users visited in the last hour