Question: How to extract upstream and downstream sequence of a set of genes using python
gravatar for t.reza
3 months ago by
t.reza0 wrote:

I want to extract the upstream and downstream sequence (for example 1000 bp up and 1000 bp down) of a set of genes (chromosome number with start and end position). Could you please suggest me a suitable package or module?

Thanks in advance

ADD COMMENTlink modified 3 months ago by A. Domingues2.2k • written 3 months ago by t.reza0

How are you defining their genes? Just by coordinate?

This is reasonably easy to do with biopython if you want to code something. Otherwise, I think seqkit can do slicing/subsetting.

ADD REPLYlink written 3 months ago by Joe16k

I have the locus locations of each gene. For example: chr1:7990339-7990408 I would like to do it with biopython. Could you suggest any built-in module available?

ADD REPLYlink written 3 months ago by t.reza0

There's no such built in module, but the syntax is quite easy.

If you take a look at my code here for example:

That script is specifically for slicing genbank files, as slicing the record preserves the format, but you can apply exactly the same logic to fastas or whatever data you have.

Essentially you need to do something like (pseudocode):

from Bio import SeqIO

for record in SeqIO.parse('infile.ext', 'format'):
    if record.description.startswith('chr1'):  # test which record you're using
        subrecord = record[start - upstream_offset: end + downstream_offset]

This is made more complicated if your file has all the chromosomes in it as you'll need to check all of the records for your chromosome index, then slice out the right bit if its the right sequence, but at its simplest that can be done with one or two loops.

ADD REPLYlink modified 3 months ago • written 3 months ago by Joe16k

I recently did the same work. And the lesson I learned is get the fasta and annotation files from the SAME source (eg, gencode). Do not fully trust the version number... And even though, I found the sequences I extracted from some genes are not completely consistent with the annotation from NCBI genebank files (sometimes they may have couple bp shift, and I don't know why...).

I believe there are some tools could do this extraction work (seqkit, bedtools?), although I coded it by myself.

ADD REPLYlink written 3 months ago by shoujun.gu280
gravatar for A. Domingues
3 months ago by
A. Domingues2.2k
Dresden, Germany
A. Domingues2.2k wrote:

Edit: this answer presumes you will want to do this in python as suggested by the tags.

pybedtools, which is a python implementation of bedtools should do the trick. You will need a combination of:

  1. flank to get the up and downstram location in relation to the inut coordinates,

  2. sequence to retrieve the actual sequences,

Bonus: there is a five_prime, and three_prime function to retrieve the start and end coordinates. It might be useful in case you want to split the up and downstream sequences.

ADD COMMENTlink modified 3 months ago • written 3 months ago by A. Domingues2.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1307 users visited in the last hour