biopython: fetching a fasta sequence from coordinates
2
0
Entering edit mode
7.0 years ago

I have information of coordinates some genes in the below given form and I intend to use the genome sequence file to get this coordinate using Biopython. The result should be in form of fasta file i.e >Bra000001 .......

Bra000001       A03   8794511   8797095
Bra000002       A03   8797899   8800379
Bra000003       A03   8812470   8813271
Bra000004       A03   8816858   8817232
Bra000005       A03   8818081   8820123
Bra000006       A03   8820739   8821050
Bra000007       A03   8823497   8824207
Bra000008       A03   8829308   8830496
Bra000009       A03   8832800   8836134

Kindly help!

sequence • 5.7k views
ADD COMMENT
0
Entering edit mode

You should give more information:

  • Code you have tried, you will probably receive more help adding it... people try to help but you should not expect others doing all the work.
  • Is it a circular genome? (the data treatment would be different)
  • In which format is the genome? (FASTA, genbank...)
  • Your expected output is confusing... you could need all the ids one after other or a fasta file like:
    >h
    AGACTTTAGATC
    >j
    GAGGATTXGATG
    
  • Always look for people with similar problems: How To Use Coordinates In Order To Extract Sequences In Fasta File? (here you have your problem solved in the first answer, but not using biopython and from a genome in bed format).
ADD REPLY
3
Entering edit mode
7.0 years ago
A. Domingues ★ 2.6k

Have you looked into pybedtools? Not exactly biopython, but if that is the only task you are trying to achieve it will do it very efficiently. Here is their example:

a = pybedtools.BedTool("""
chr1 1 10
chr1 50 55""", from_string=True)
fasta = pybedtools.example_filename('test.fa')
a = a.sequence(fi=fasta)
print(open(a.seqfn).read())
chr1:1-10
GATGAGTCT
chr1:50-55
CCATC

This can be easily changed to read your bed file and fasta:

import pybedtools
from pybedtools import BedTool
a = BedTool("path_to_coordinates.bed")
fasta = BedTool("path_to_sequence.fasta")
a = a.sequence(fi=fasta)

print(open(a.seqfn).read())

It's untested, but it should work with minor changes.

ADD COMMENT
0
Entering edit mode
7.0 years ago
James Ashmore ★ 3.3k

Try using bedtools getfasta command. If takes as input the FASTA file of sequences and a tab-separated value file of the coordinates you want to retrieve. If you absolutely need to do this in Python, you can always use Python's subproces module to externally call the command and capture the output.

ADD COMMENT
0
Entering edit mode

The pybedtools package already wraps most of bedtools functions, so in most cases there is no need to use subproces.

ADD REPLY

Login before adding your answer.

Traffic: 2141 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6