Question

sequence extract with python script

0

Entering edit mode

5.7 years ago

raveendars • 0

Hi I want to know how to run the python script which available here https://github.com/mawad89/sequence-extract

I run this script with fasta and gb file as mentioned with number (1.fasta.fa, 2.Genbank.gb) I get syntax error like

File "Sequence Extract.py" line 274
pm=pm+1
SyntaxError: invalid syntax

Thank you

Ravi

genome sequence software error • 3.0k views

ADD COMMENT • link updated 5.7 years ago by finswimmer 16k • written 5.7 years ago by raveendars • 0

2

Entering edit mode

I would run fast and far from that script, it’s a mess. Looking at the Github page, much of the indentation appears to be messed up, and it uses a lot of outdated python syntax. This task can probably be achieved in under 100 lines with BioPython.

Please just edit your question to describe the task you want to achieve other than “make this script run”, and someone may have an even better solution.

ADD REPLY • link 5.7 years ago by Joe 21k

0

Entering edit mode

Example fasta would have helped addressing the issue @ raveendars

ADD REPLY • link 5.7 years ago by cpad0112 21k

0

Entering edit mode

Hi Thank you kindly look at the article https://doi.org/10.1016/j.compbiolchem.2017.09.003

ADD REPLY • link 5.7 years ago by raveendars • 0

0

Entering edit mode

If there is no specific reason to use this script, then please explain what do you want to achieve at the end and what files you already have for that. Additionally, we want to know your inputs for the same i.e. what efforts did you put in?

ADD REPLY • link 5.7 years ago by lakhujanivijay 5.8k

0

Entering edit mode

Hi Thank you for your reply I would like to use the method mentioned in the following article https://doi.org/10.1016/j.compbiolchem.2017.09.003

In siligo analysis with chloroplast genome to differentiate Triticum species.
I would like to extract the intergenic region from the chloroplast genome for that I have downloaded the fasta and gb file from NCBI. So, I followed the instruction as the author mentioned but got the error while running with windows python 2.7

ADD REPLY • link 5.7 years ago by raveendars • 0

score 3 · Answer 1 · 2018-08-02

Hello again,

based on the task you described in this post here is a workflow that could help.

The following python script takes the genebank file as input and creates these 3 files:

a fasta file with the genome sequence
a bed file with the coordinates of the genes
a dummy bed file for the whole genome, which we use later for getting the intergenic sequences

Run the script like this:

$ python extract_files.py KC912694.gb

extract_files.py looks like this:

import sys
import os
from Bio import SeqIO


def main(input):
    file_name = os.path.basename(input)
    outname = os.path.splitext(file_name)[0]

    SeqIO.convert(input, "genbank", outname+".fasta", "fasta")

    for record in SeqIO.parse(input, "genbank"):
        record_name = record.annotations["accessions"][0]+"."+str(record.annotations["sequence_version"])

        with open(record_name+".bed", "w") as bed, open("genome.bed", "w") as genome:
            for feature in record.features:
                if feature.type == "source":
                    start = feature.location.start.position
                    stop = feature.location.end.position

                    bed_line = "{0}\t{1}\t{2}\n".format(record_name, start, stop)
                    genome.write(bed_line)
                if feature.type == 'gene':
                    start = feature.location.start.position
                    stop = feature.location.end.position
                    try:
                        name = feature.qualifiers['gene'][0]
                    except:
                        name = feature.qualifiers['locus_tag'][0]

                    bed_line = "{0}\t{1}\t{2}\t{3}\n".format(record_name, start, stop, name)
                    bed.write(bed_line)


if __name__ == '__main__':
    main(sys.argv[1])

To get the sequence of the genes we can use bedtools getfasta:

$ bedtools getfasta -fi KC912694.fasta -bed KC912694.1.bed -fo genes.fa

To get the intergenic sequences we have to create a bed file containing the regions

$ bedtools subtract -a genome.bed -b KC912694.1.bed > intergenic.bed

Use bedtools getfasta again to get intergenic sequences:

$ bedtools getfasta -fi KC912694.fasta -bed intergenic.bed -fo intergenic.fa

fin swimmer

score 2 · Answer 2 · 2018-08-02

2

Entering edit mode

5.7 years ago

finswimmer 16k

Hello,

the problem is the try block. If using try python expect also an except.

Add this before pm=pm+1:

except:
    pass

fin swimmer

ADD COMMENT • link 5.7 years ago by finswimmer 16k

0

Entering edit mode

Hi Thank you for the reply but I could not solve the issue as its simply return kindly look at the article https://doi.org/10.1016/j.compbiolchem.2017.09.003 Thank you

ADD REPLY • link 5.7 years ago by raveendars • 0

0

Entering edit mode

Hello raveendars,

Hi Thank you for the reply but I could not solve the issue as its simply return

is there something missing in this sentence? It isn't clear to me what you like to say. Please rephrase.

I have no access to this paper. But what I would suggest to you:

Describe what you like to do and give us examples of your input data and desired output
Don't use the script you've linked here. It's looking very ugly to me.

fin swimmer

ADD REPLY • link 5.7 years ago by finswimmer 16k

0

Entering edit mode

Hello swimmer I would like to do in siligo analysis with chloroplast (cp) genome to differentiate Triticum species. I would like to extract the gene and intergenic position separately into fasta file from the chloroplast genome available in NCBI https://www.ncbi.nlm.nih.gov/nuccore/KC912694 perform sequence alignment (MSA) to find snp to discriminates species

Thank you

ADD REPLY • link 5.7 years ago by raveendars • 0