How can I separate a FASTA file into multiple exon and CDS files?
1
2
Entering edit mode
7.6 years ago

NOTE: I already tried using bedtools and bcbio-gff but they were not useful

I have a gff3 file and a FASTA genome file

The gff3 file is like this:

20 protein2genome exon 12005 12107 . - . ID=Blah_exon:1;Name=Blah:1;Parent=Blah

20 protein2genome exon 12108 12200 . - . ID=Blah_exon:2;Name=Blah:1;Parent=Blah

20 protein2genome exon 12005 12107 . - . ID=Blah_exon:3;Name=Blah:1;Parent=Blah

20 protein2genome exon 12342 12542 . - . ID=Blah_exon:4;Name=Blah:1;Parent=Blah

20 protein2genome exon 13005 13107 . - . ID=ABC_exon:1;Name=ABC:1;Parent=ABC

For each exon, I need to write a new file named gene_name_exon_number.fa

The header inside the file is this:

gene_name exon_number chromosome_name exon_start exon_end

followed by the nucleotide bases in its chromosome using the start and end numbers

So the first exon file would look like:

Blah 1 20 12005 12107 ACGGGCCTCAAAGCCGCGACACGACGGCTGTCGGCCGGTAACAGTAACCCCGGAGTGAACTCCTAT

Each exon will be a new file.

Also, I need to create a CDS file for each gene containing all the exons named gene_name_cds.fa

The header inside the file is this:

gene_name number_of_exons chromosome_name cds_start cds_end cds_total_length

So the first CDS file would look like:

Blah 4 20 12005 12542 ACGGGCCTCAAAGCCGCGACACGACGGCTGTCGGCCGGTAACAGTAACCCCGGAGTGAACTCCTAT ACGGGCCTCAAAGCCGCGACACGACGGCTGTCGGCCGGTAACAGTAACCCCGGAGTGAACTCCTAT

Note: the nucleotide bases are just all of the nucleotide bases from the exons pasted together.

In the end, there will be several exon files and one CDS file for EACH gene in a GFF3 file containing several hundred genes and thousands of exons.

However, the file is generating CDS files in the gigabtye range when the entire genome file is only 130mb. So the problem must be for loop order. Can someone help me fix the loop order?

I tried to use several for loops and prepender. My code looks like this so far:

import fileinput

input_file = "dummy.gf3"
search_file = "dummy.fa"

def line_pre_adder(filename, line_to_prepend):
    f = fileinput.input(filename, inplace=1)
    for xline in f:
        if f.isfirstline():
            print line_to_prepend.rstrip('\r\n') + '\n' + xline,
        else:
           print xline,

def exon_extractor():

exon_file = open(input_file)
for exon_line in exon_file:
    previous = ""
    CDS_exon_number = 1
    CDS_start = int(0)
    CDS_end = int(0)
    NMID = exon_line.split()[-1].split()[0].rsplit(';')[0][3:].rsplit("_")[0]
    exon_number = str(exon_line.split()[-1].split()[0].rsplit(';')[0][-1])
    filename = str(exon_line.split()[-1].split()[0].rsplit(';')[0][3:-2]) + "_" + str(exon_line.split()[-1].split()[0].rsplit(';')[0][-1])
    chromosome = exon_line.split()[0]
    start = exon_line.split()[3]
    end = str(int(exon_line.split()[4])+1)
    exon_header = ">" + NMID + " " + exon_number + " " + start + " " + str(int(end)-1)
    exon_file = open (str("exon_folder" + filename + ".fas"),"w+")
    cds_file = open (str("test_folder" + NMID +"_CDS" + ".fas"),"a+")
    exon_file.write(exon_header + '\n')
    desired = int(end) - int(start)
    with open (search_file,'r') as genome_file:         
        for genome_line in genome_file:
            if str(chromosome + " ") in genome_line:
                for genome_line in genome_file:
                    for i in range (int(desired)/60):
                        exon_file.write(genome_line.rstrip())

                        if NMID != previous:
                            cds_file.close()
                            line_pre_adder(str("test_folder" + NMID +"_CDS" + ".fas"), str(NMID + " " + str(CDS_exon_number) + " " + str(CDS_start) + " " + str(CDS_end) + " " + str(CDS_end-CDS_start)))
                            cds_file = open (str("test_folder" + NMID +"_CDS" + ".fas"),"a+")
                            cds_file.write(genome_line.rstrip())
                            previous = NMID
                            CDS_exon_number = exon_number
                            CDS_start += int(start)
                            CDS_end += (int(end)-1)
                        else:
                            cds_file.write(genome_line.rstrip())
                            CDS_exon_number = exon_number
                            CDS_start += int(start)
                            CDS_end += (int(end)-1)

                    if (int(desired) % 60) != 0:
                        exon_file.write(genome_line[0:(int(desired) % 60)+1].rstrip())
                        cds_file.write(genome_line[0:(int(desired) % 60)+1].rstrip())
                        break

                    else:
                        pass
            else:
                pass
python genome fasta annotation sequence • 2.9k views
ADD COMMENT
2
Entering edit mode

I already tried using bedtools and bcbio-gff but they were not useful

Not useful in what way? At first glance bedtools getfasta should do what you need. You can pull out (and split) sequences you need after you get all the intervals out first (many threads about this request on Biostars).

ADD REPLY
0
Entering edit mode

The problem is that I got only one file like this which only contains the exons. I still need another file which contains the CDS for each gene and the headers were wrong for the file. Also, I couldn't split the FASTA file using the "split" command in bedtools.

If I could get two files, one with the exons and one with the CDS, and each file had the correct headers, then bedtools would be useful.

The exon file looked like this and the headers did not include the gene name so they weren't in the format: gene_name exon_number chromosome_name exon_start exon_end

Blah 1 20 12005 12107 
>Chr20:12005-12107
ACGGGCCTCAAAGCCGCGACACGACGGCTGTCGGCCGGTAACAGTAACCCCGGAGTGAACTCCTAT
>Chr12:13005-13107
ACGGGCCTCAAAGCCGCGACACGACGGCTGTCGGCCGGTAACAGTAACCCCGGAGTGAACTCCTAT
ADD REPLY
0
Entering edit mode

Just some remarks, no complete solution:

  • I would suggest writing two scripts: one for the exons and one for the gene. Might take a bit longer but lot's easier. I suppose you need to do this job only once and not daily from now on
  • You don't need an else clause with pass, you can remove that. Just an if clause works fine
ADD REPLY
0
Entering edit mode
CDS_exon_number = 1
CDS_start = int(0)
CDS_end = int(0)
  

I heard this code has such an interesting backstory they are thinking about making a cinematic adaptation.

ADD REPLY
7
Entering edit mode
7.6 years ago
glihm ▴ 660

You are uselessly adding complexity to your code and algorithm.

Re-organizing your code with functions dedicated to ONE thing can make your life easier (and help you to write a readable code) !

Are you limited with the RAM ? Because you are opening your fasta file for EACH exons. I recommend you to charge in memory your fasta file and store for each chromosome it's sequence. Like this, it will be easier to have an access to your exon sequence like this:

exon_seq = fasta_dictionary[ chromosome ][ inf_limit : sup_limit ]

An other thing, you are looping twice on your fasta file:

with open (search_file,'r') as genome_file:         
    for genome_line in genome_file:
        if str(chromosome + " ") in genome_line:
            for genome_line in genome_file:   <=== ???
  

If I understand this part of your code:

Opening the fasta file with the genome:
    For each line of my file:
        If the line contains my chromosome identifier (I recommend you to use REGEX here):
             Newly, for each line of my file...? Why ?

I think your problem in the loop comes form this point.

And other time, try to load your fasta file in memory, and then doing the stuff for each exons completing your CDS at the same time. If you are not limited with the RAM of your machine, try to think differently to make things easier:

0) Load your fasta in memory (1 function)

1) A global dictionary to store the exons of genes.

2) For each exon of your exon file, store the information you need in your global dictionary. (1 function, called when you are looping on your exon file)

3) For each exon processed, append to your CDS variable all you need. (in the same function of the previous point)

4) Finally, for each gene detected, write your results. (1 function)

If these comments were not helpful, feel free to post a commentary and we will check it again !

Best of luck !

ADD COMMENT
0
Entering edit mode

Thanks! This has been very helpful. Loading the FASTA file in memory is a smart idea. I will work on this.

Here are the comments that I made on the looping twice section in case you needed to see those.

    with open (search_file,'r') as genome_file:         
        for genome_line in genome_file:
            # This for loop navigates to the chromosome of a given exon in the genome file
            if str(chromosome + " ") in genome_line:
                # print genome_line.rstrip()
                for genome_line in genome_file:
                    # This for loop prints the nucleotide bases in the given exon
ADD REPLY

Login before adding your answer.

Traffic: 2227 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6