How do I fix a paradox/casual loop in a program splitting a FASTA file into multiple files?
1
0
Entering edit mode
8.2 years ago

I created a file that takes a GFF3 file and a FASTA genome file and writes each exon from the GFF3 file into a new file and each gene group of exons into a new CDS file. The problem is that the exons write to a separate file but the CDS files have only a header and no CDS sequence.

I suspect that the problem begins at line

   if Gene_Name != previous:

when I try to see whether the parent gene of the exon is the same as the previous exon. If it's not, then the current CDS file is written and closed before opening a new CDS file.

Any suggestions on fixing the if else clause beginning at

   if Gene_Name != previous:

I think it's because I set the value previous inside the scope of the program.

How do I fix this paradox/casual loop in a program splitting a file into multiple files?

Input files:

ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.49_FB2013_01/fasta/dmel-all-chromosome-r5.49.fasta.gz This is the FASTA file

ftp://ftp.ensemblgenomes.org/pub/metazoa/release-32/gff3/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.32.chromosome.2L.gff3.gz This is the gff3 file.


I have a gff3 file and a FASTA genome file

The gff3 file is like this:

20 protein2genome exon 12005 12107 . - . ID=Blah_exon:1;Name=Blah:1;Parent=Blah

20 protein2genome exon 12108 12200 . - . ID=Blah_exon:2;Name=Blah:1;Parent=Blah

20 protein2genome exon 12005 12107 . - . ID=Blah_exon:3;Name=Blah:1;Parent=Blah

20 protein2genome exon 12342 12542 . - . ID=Blah_exon:4;Name=Blah:1;Parent=Blah

20 protein2genome exon 13005 13107 . - . ID=ABC_exon:1;Name=ABC:1;Parent=ABC

For each exon, I need to write a new file named gene_name_exon_number.fa

The header inside the file is this:

gene_name exon_number chromosome_name exon_start exon_end

followed by the nucleotide bases in its chromosome using the start and end numbers

So the first exon file would look like:

Blah 1 20 12005 12107 ACGGGCCTCAAAGCCGCGACACGACGGCTGTCGGCCGGTAACAGTAACCCCGGAGTGAACTCCTAT

Each exon will be a new file.

Also, I need to create a CDS file for each gene containing all the exons named gene_name_cds.fa

The header inside the file is this:

gene_name number_of_exons chromosome_name cds_start cds_end cds_total_length

So the first CDS file would look like:

Blah 4 20 12005 12542 ACGGGCCTCAAAGCCGCGACACGACGGCTGTCGGCCGGTAACAGTAACCCCGGAGTGAACTCCTAT ACGGGCCTCAAAGCCGCGACACGACGGCTGTCGGCCGGTAACAGTAACCCCGGAGTGAACTCCTAT

Note: the nucleotide bases are just all of the nucleotide bases from the exons pasted together.

In the end, there will be several exon files and one CDS file for EACH gene in a GFF3 file containing several hundred genes and thousands of exons.

For the exon, I need this:

gene_name exon_number chromosome_name exon_start exon_end

For the CDS, I need this:

gene_name number_of_exons chromosome_name cds_start cds_end cds_total_length

cds_total_length should be cds_end-cds_start

from collections import defaultdict
from Bio import SeqIO

Exon_file = "exon.gff3"
FASTA_file = "melanogaster.fasta"

handle = open(FASTA_file, "rU")
FASTA_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"))
handle.close()

Exon_dict = {}
with open(Exon_file) as f:
    for line in f:
        Exon_dict[line.split()[-1].rsplit(";")[0].replace(":","")[3:]] = line.split()

previous = ""
CDS_sequence = ""
CDS_number_of_exons = int(0)
CDS_start = int(0)
CDS_end = int(0)
CDS_file = open (str("testfolder/" + "DUMMY" +"_CDS" + ".fas"),"w+")

for key, value in Exon_dict.iteritems():
    for k,v in FASTA_dict.iteritems():
        chromosome = value[0]
        direction = value[6]
        Gene_Name = value[-1].split()[0].rsplit(';')[0][3:].rsplit("_")[0]
        if chromosome == k:
            start = value[3]
            end = value[4]
            exon_file = open (str("testfolder/" + key + ".fas"),"w+")
            exon_file.write(">" + key + " " + k + " " + start + " " + end + "\n")
            if direction == "+":
                exon_file.write(str(v.seq[int(int(start)-1):(int(end))]))
                CDS_sequence += (str(v.seq[int(int(start)-1):(int(end))]))
            elif direction == "-":
                exon_file.write(str(v.seq[int(int(start)-1):(int(end))].reverse_complement()))
                CDS_sequence += (str(v.seq[int(int(start)-1):(int(end))].reverse_complement()))

        if Gene_Name != previous:
            CDS_file.write(">" + str(Gene_Name) + " " + str(CDS_number_of_exons) + " " + str(chromosome) + " " + str(CDS_start) + " " + str(CDS_end) + " " + str(int(CDS_end)-int(CDS_start)) + "\n")
            CDS_file.write(CDS_sequence)
            CDS_file.close()
            CDS_file = open (str("testfolder/" + Gene_Name +"_CDS" + ".fas"),"w+")
            previous = Gene_Name
            CDS_sequence = ""
            CDS_number_of_exons = int(0)
            CDS_start = int(0)
            CDS_end = int(0)
        else:
            CDS_number_of_exons += 1
            CDS_end += int(end)
            if start > CDS_start:
                pass
            else:
                CDS_start = start
annotation fasta sequence python genome • 2.5k views
ADD COMMENT
1
Entering edit mode
8.2 years ago

I take it you're new to python. Instead of CDS_sequence.__add__(something), you want CDS_sequence += something. As it stands, you never append anything to CDS_sequence.

ADD COMMENT
0
Entering edit mode

BTW, the for k,v in FASTA_dict.iteritems(): loop is completely unnecessary. You made a dictionary, use it as such.

ADD REPLY
0
Entering edit mode

Thanks! += fixed the CDS_sequence. However, I'm still having trouble with the count for CDS_start, CDS_end, and CDS_number_of_exons. Any suggestions? Also, why do you say this?

BTW, the for k,v in FASTA_dict.iteritems(): loop is completely unnecessary. You made a dictionary, use it as such.

ADD REPLY
0
Entering edit mode

It's completely unclear from your code exactly what NMID is getting set to, so put some print("NMID {}, previous {}").format(NMID, previous)) in there.

Regarding the loop, you already know what the chromosome is, so:

for key, value in Exon_dict.iteritems():
    chromosome = value[0]
    direction = value[6]
    v = FASTA_dict[chromosome]

Also, delete all instances of the following:

else:
    pass

Those are just wasting CPU.

ADD REPLY
0
Entering edit mode

Thanks! NMID is the name of a parent gene. I've updated the code to reflect some edits and deleted all instance of.

else:
pass

I'll look at the for loop you suggested. The exon loop isn't the problem since it prints all of the bases. It's these lines:

if Gene_Name != previous:
                CDS_file.write(">" + str(Gene_Name) + " " + str(CDS_number_of_exons) + " " + str(chromosome) + " " + str(CDS_start) + " " + str(CDS_end) + " " + str(int(CDS_end)-int(CDS_start)) + "\n")
                CDS_file.write(CDS_sequence)
                CDS_file.close()
                CDS_file = open (str("testfolder/" + Gene_Name +"_CDS" + ".fas"),"w+")
                previous = Gene_Name
                CDS_sequence = ""
                CDS_number_of_exons = int(0)
                CDS_start = int(0)
                CDS_end = int(0)
            else:
                CDS_number_of_exons += 1
                CDS_end += int(end)
                if start > CDS_start:
                    pass
                else:
                    CDS_start = start
ADD REPLY

Login before adding your answer.

Traffic: 1382 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6