Extraction Of Fasta Sequence By Nucleotide Position
3
2
Entering edit mode
11.0 years ago
rosarylimyt ▴ 70

Hi all,

I'm relatively new in Python and I need help in extraction of FASTA sequences using separate lists of start and end coordinates. Here are my codes below:

import sys
import os
from Bio import SeqIO

infile1 = sys.argv[1] # your original FASTA file to extract sequences from                                                                                                                               
fasta_file = open(infile1)

### To extract your set of start and end coordinates                                                                                                                                                       
start_coordinates = os.system("cat indices_trial | awk '{print $3}'")
end_coordinates = os.system("cat indices_trial | awk '{print $5}'")

results_file = raw_input('Enter a new filename to save your results:')
outfile = open(results_file, 'w')

### Now to tell Python to extract those sequences whose coordinates were not specified in 'start_coordinates' & 'end_coordinates' list                                                                     
for line in SeqIO.parse(fasta_file, "fasta"):
    if line.seq != line.seq["%i-1:%i " % start_coordinates, end_coordinates]:
        print 'Unpredicted gene:' , line.seq , ',' , 'Sequence length =' , len(line.seq)
    else:
        print 'Not found.'

Can some python programmer out there help me?

Rosary

fasta python nucleotide position • 8.1k views
ADD COMMENT
0
Entering edit mode

if it were any help, here's the execution and the error message that follows:

1 5

4 10

Enter a new filename to save your results:trial_results Traceback (most recent call last): File "extract_UNpredicted_genes.py", line 17, in <module> if line.seq != line.seq["%i:%i + 1" % start_coordinates, end_coordinates]: TypeError: not enough arguments for format string

ADD REPLY
0
Entering edit mode

Hi, please show us a few sample lines of infile1.

ADD REPLY
0
Entering edit mode

Hi,

An example of infile1 would be;

Scaffold2 TAAAATGTGAGTTTTTAAATTACTGCGTGCCTCTGGGTAATAATCTCCCCAAAGTAATTC CGATAAATCTTCATAGCTAACCGTTTTATCATGGTTTTTCATGAGCATATAAATAATGCG ACCTTCACTTACGGTGAGATTTACCAGTGTGCCATTTATCCATAGTTCTCTAATTGATGA ACCAAAATGCATTTTACCACAAGTGATGGATAGGTCCTCACCGTTTATTTGCTTTTTAGT TAATAATCGCCTGACTCTGGCTAGTAATTCCATTTGGCGGAACGGTTTAATAATATAATC

followed by scaffold 3,4,5 and so on, they are basically fasta nucleotide sequences.

Thanks!

ADD REPLY
3
Entering edit mode
11.0 years ago
David W 4.9k

The error message is telling you that the %-style string formatting isn't working. I think if you wrap your coordinates in parentheses you won't get that error anymore.

BUT...

If I can grok what you are trying to do you'll get a bunch of new errors instead!

The % synatx is for creating strings with the arguements inserted. But you want to use the integers themselves to slice your sequence line.seq[start_coordinates:end_coordinates] would do the trick (be aware python uses 0-based indices). Also, if you are just learning python it's porbably a good idea to learn the newer .format()-style string formatting.

It's unusual to all awk to read your indices. You can open a file with open(filename), move through the lines with a for loop , and strip a tab-delimited line into it's elements with line.strip('\t'). Once you have a list of indices you'll need to iterate through the list to get each combination (not, as I think you are trying to do now, use the whole list to do your squence-slices).

Generally, the best way to get to grips with Python and Biopython is (a) read a good tutorial (b) open an interactive prompt and play around a bit. Once you are conformatable with slicing sequences and handling file I/O and the like you'll find building your script much easier.

ADD COMMENT
0
Entering edit mode

Hey David,

yes I admit this script of mine is indeed clumsy..I'm trying to work on one which you suggested, thanks for the advice!=]

Rosary

ADD REPLY
2
Entering edit mode
11.0 years ago

A separate word of caution unrelated to the script itself (and apologies if you know this already - I just thought it was important to mention): some file formats have 0-based coordinates, while others have 1-based ones, and it's pretty random which file format uses which (e.g. BED files are 0-based while GGF files are 1-based). So in order to extract the correct coordinates, it's important to know which format the coordinates from your input file are in. I quote the following nice explalation from the beginning of the SAM file specifications (http://samtools.sourceforge.net/SAM1.pdf):

1-based coordinate system: A coordinate system where the first base of a sequence is one. In this coordinate system, a region is specified by a closed interval. For example, the region between the 3rd and the 7th bases inclusive is [3, 7]. The SAM, GFF and Wiggle formats are using the 1-based coordinate system.

0-based coordinate system: A coordinate system where the first base of a sequence is zero. In this coordinate system, a region is specified by a half-closed-half-open interval. For example, the region between the 3rd and the 7th bases inclusive is [2, 7). The BAM, BED, and PSL formats are using the 0-based coordinate system.

ADD COMMENT
1
Entering edit mode

Hey Jelena,

Thanks for the advice! I didn't know of such a thing before =] Really appreciate it.

Rosary

ADD REPLY
2
Entering edit mode
11.0 years ago
Fabio Marroni ★ 3.0k

I suggest you use pyfasta. Follow the link, it also has some very simple examples. It's the first python command I used and it was really easy. https://pypi.python.org/pypi/pyfasta/

ADD COMMENT

Login before adding your answer.

Traffic: 2387 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6