Python Script that outputs axons from gff file to a new file
1
0
Entering edit mode
21 months ago
Michal Nevo ▴ 130

I have a gff file, I want to write a script that outputs axons only to a new file. I wonder what is the most efficient way to do so. I need to iterate over the input file and split the lines into blocks starting with “gene” [one block includes all lines after 'gene' and ends one line before the next 'gene'].
For each block: I need to check if it contains lines with “mRNA”. if not continue to the next block. For every block that starts with “mRNA”, I need to write the lines that contain “exon” to the output file.

Is there a good package in python that can help me? How do I create the blocks?

This is the gff file:

NC_048323.1     Gnomon  gene    25044   78977   .       +       .       ID=gene                                               -LOC117420859;Dbxref=GeneID:117420859;Name=LOC117420859;gbkey=Gene;gene=LOC1174                                               20859;gene_biotype=protein_coding
NC_048323.1     Gnomon  mRNA    25044   78977   .       +       .       ID=rna-                                               XM_034926345.1;Parent=gene-LOC117420859;Dbxref=GeneID:117420859,Genbank:XM_0349                                               26345.1;Name=XM_034926345.1;gbkey=mRNA;gene=LOC117420859;model_evidence=Support                                               ing evidence includes similarity to: 2 Proteins%2C and 93%25 coverage of the an                                               notated genomic feature by RNAseq alignments;product=coiled-coil domain-contain                                               ing protein 171-like;transcript_id=XM_034926345.1
NC_048323.1     Gnomon  exon    25044   25136   .       +       .       ID=exon                                               -XM_034926345.1-1;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XM_                                               034926345.1;gbkey=mRNA;gene=LOC117420859;product=coiled-coil domain-containing                                                protein 171-like;transcript_id=XM_034926345.1
NC_048323.1     Gnomon  exon    25929   26031   .       +       .       ID=exon                                               -XM_034926345.1-2;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XM_                                               034926345.1;gbkey=mRNA;gene=LOC117420859;product=coiled-coil domain-containing                                                protein 171-like;transcript_id=XM_034926345.1
NC_048323.1     Gnomon  exon    34873   34976   .       +       .       ID=exon                                               -XM_034926345.1-3;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XM_                                               034926345.1;gbkey=mRNA;gene=LOC117420859;product=coiled-coil domain-containing                                                protein 171-like;transcript_id=XM_034926345.1
NC_048323.1     Gnomon  exon    37369   37437   .       +       .       ID=exon                                               -XM_034926345.1-4;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XM_                                               034926345.1;gbkey=mRNA;gene=LOC117420859;product=coiled-coil domain-containing                                                protein 171-like;transcript_id=XM_034926345.1
NC_048323.1     Gnomon  exon    37932   38132   .       +       .       ID=exon                                               -XM_034926345.1-5;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XM_                                               034926345.1;gbkey=mRNA;gene=LOC117420859;product=coiled-coil domain-containing                                                protein 171-like;transcript_id=XM_034926345.1
NC_048323.1     Gnomon  exon    56841   57008   .       +       .       ID=exon                                               -XM_034926345.1-6;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XM_                                               034926345.1;gbkey=mRNA;gene=LOC117420859;product=coiled-coil domain-containing                                                protein 171-like;transcript_id=XM_034926345.1
NC_048323.1     Gnomon  exon    62592   62780   .       +       .       ID=exon                                               -XM_034926345.1-7;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XM_                                               034926345.1;gbkey=mRNA;gene=LOC117420859;product=coiled-coil domain-containing                                                protein 171-like;transcript_id=XM_034926345.1
NC_048323.1     Gnomon  exon    64485   64992   .       +       .       ID=exon                                               -XM_034926345.1-8;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XM_                                               034926345.1;gbkey=mRNA;gene=LOC117420859;product=coiled-coil domain-containing                                                protein 171-like;transcript_id=XM_034926345.1
NC_048323.1     Gnomon  exon    65613   65729   .       +       .       ID=exon                                               -XM_034926345.1-9;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XM_                                               034926345.1;gbkey=mRNA;gene=LOC117420859;product=coiled-coil domain-containing                                                protein 171-like;transcript_id=XM_034926345.1
NC_048323.1     Gnomon  exon    72109   72335   .       +       .       ID=exon                                               -XM_034926345.1-10;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XM                                               _034926345.1;gbkey=mRNA;gene=LOC117420859;product=coiled-coil domain-containing                                                protein 171-like;transcript_id=XM_034926345.1
NC_048323.1     Gnomon  exon    75873   76055   .       +       .       ID=exon                                               -XM_034926345.1-11;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XM                                               _034926345.1;gbkey=mRNA;gene=LOC117420859;product=coiled-coil domain-containing                                                protein 171-like;transcript_id=XM_034926345.1
NC_048323.1     Gnomon  exon    76336   76521   .       +       .       ID=exon                                               -XM_034926345.1-12;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XM                                               _034926345.1;gbkey=mRNA;gene=LOC117420859;product=coiled-coil domain-containing                                                protein 171-like;transcript_id=XM_034926345.1
NC_048323.1     Gnomon  exon    78960   78977   .       +       .       ID=exon                                               -XM_034926345.1-13;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XM                                               _034926345.1;gbkey=mRNA;gene=LOC117420859;product=coiled-coil domain-containing                                                protein 171-like;transcript_id=XM_034926345.1
NC_048323.1     Gnomon  CDS     25044   25136   .       +       0       ID=cds-                                               XP_034782236.1;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XP_034                                               782236.1;Name=XP_034782236.1;gbkey=CDS;gene=LOC117420859;product=coiled-coil do                                               main-containing protein 171-like;protein_id=XP_034782236.1
NC_048323.1     Gnomon  CDS     25929   26031   .       +       0       ID=cds-                                               XP_034782236.1;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XP_034                                               782236.1;Name=XP_034782236.1;gbkey=CDS;gene=LOC117420859;product=coiled-coil do                                               main-containing protein 171-like;protein_id=XP_034782236.1
NC_048323.1     Gnomon  CDS     34873   34976   .       +       2       ID=cds-                                               XP_034782236.1;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XP_034                                               782236.1;Name=XP_034782236.1;gbkey=CDS;gene=LOC117420859;product=coiled-coil do                                               main-containing protein 171-like;protein_id=XP_034782236.1
NC_048323.1     Gnomon  CDS     37369   37437   .       +       0       ID=cds-                                               XP_034782236.1;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XP_034                                               782236.1;Name=XP_034782236.1;gbkey=CDS;gene=LOC117420859;product=coiled-coil do                                               main-containing protein 171-like;protein_id=XP_034782236.1
NC_048323.1     Gnomon  CDS     37932   38132   .       +       0       ID=cds-                                               XP_034782236.1;Parent=rna-XM_034926345.1;Dbxref=GeneID:117420859,Genbank:XP_034                                               782236.1;Name=XP_034782236.1;gbkey=CDS;gene=LOC117420859;product=coiled-coil do                                               main-containing protein 171-like;protein_id=XP_034782236.1
NC_048323.1     Gnomon  mRNA    111664  172479  .       -       .       ID=rna-                                               XM_034035428.2;Parent=gene-LOC117421266;Dbxref=GeneID:117421266,Genbank:XM_0340                                               35428.2;Name=XM_034035428.2;gbkey=mRNA;gene=LOC117421266;model_evidence=Support                                               ing evidence includes similarity to: 13 Proteins%2C and 100%25 coverage of the                                                annotated genomic feature by RNAseq alignments%2C including 10 samples with sup                                               port for all annotated introns;product=eukaryotic translation initiation factor                                                2-alpha kinase 3-like%2C transcript variant X1;transcript_id=XM_034035428.2
NC_048323.1     Gnomon  exon    172022  172479  .       -       .       ID=exon                                               -XM_034035428.2-1;Parent=rna-XM_034035428.2;Dbxref=GeneID:117421266,Genbank:XM_                                               034035428.2;gbkey=mRNA;gene=LOC117421266;product=eukaryotic translation initiat                                               ion factor 2-alpha kinase 3-like%2C transcript variant X1;transcript_id=XM_0340                                               35428.2
NC_048323.1     Gnomon  exon    157760  157889  .       -       .       ID=exon                                               -XM_034035428.2-2;Parent=rna-XM_034035428.2;Dbxref=GeneID:117421266,Genbank:XM_                                               034035428.2;gbkey=mRNA;gene=LOC117421266;product=eukaryotic translation initiat                                               ion factor 2-alpha kinase 3-like%2C transcript variant X1;transcript_id=XM_0340                                               35428.2
mRNA python gff script exons • 570 views
ADD COMMENT
0
Entering edit mode
21 months ago
Michal Nevo ▴ 130

I will update that I wrote a script that parses the GFF file and outputs all exons that are a part of mRNA (and not tRNA for example. All exons that appears after mRNAs)

I used Pandas and that's it.

import sys
import pandas as pd


def parse_df(gff_df, out_file):
    """
       Outputs all exons of mRNAs
       :param gff_df: dataFrame, out_file: output file

    """

out = open(out_file, 'a')
gene_indexes = gff_df[(gff_df['region'] == 'gene')]
# print(gene_indexes.to_string())
gene_index_list = []
for i, row in gene_indexes.iterrows():
    gene_index_list.append(i)
gene_index_list.append(len(gff_df)-1)
# print('gene index list:', gene_index_list)
gene_index_list = make_pairs(gene_index_list)
# print('gene index list pairs:', gene_index_list)

''' Create gene blocks ''' 
num_of_blocks = 0
for index_pair in gene_index_list:
    num_of_blocks += 1
    gene_block = gff_df[index_pair[0]: index_pair[1]]
    print('INDEX OF GENE BLOCK: {}-{}'.format(index_pair[0], index_pair[1]) )
    # print('gene_block:')
    # print(gene_block.to_string())
    mRNA_indexs_of_block_sub_df = gene_block[(gene_block['region'] == 'exon') & (gene_block['region'] == 'mRNA').shift()]
    mRNA_indexs_of_block = []
    for i, row in mRNA_indexs_of_block_sub_df.iterrows():
        mRNA_indexs_of_block.append(i)
    print('mRNA indexes of block: ', mRNA_indexs_of_block)

    ''' Skip block if no mRNAs ''' 
    if len(mRNA_indexs_of_block) > 0:
        num_of_mRNA_per_block = 0
        for ind, row in gene_block.iterrows():
            if row['region'] == 'mRNA':
                num_of_mRNA_per_block += 1
                index_mRNA = ind
                num_of_tRNA = 0
                for ind, row in gene_block.loc[index_mRNA+1:].iterrows():
                    if row['region'] == 'tRNA':
                        num_of_tRNA += 1
                    elif row['region'] == 'exon' and num_of_tRNA == 0:
                        # print(gene_block.loc[[ind]].to_string())
                        out.write(gene_block.loc[[ind]].to_string())

        print('Block number: {} , Number of mRNAs for this block: {}'.format(num_of_blocks,num_of_mRNA_per_block) )


def create_df_from_gff(file_path):
    """
    Create a dataframe from the gff file
    :param file_path: file path
    :return: dataFrame
    """
    col = {0: "chrom_name", 1: "RefSeq", 2: "region", 3: "start", 4: "end", 5: "score", 6: "strand", 7: "frame", 8: "attribute"}
    df_gff = pd.read_csv(file_path, sep="\t", header=None, skiprows=9)
    df_gff = df_gff.rename(columns=col)
    return df_gff


def make_pairs(_list):

    new_list = []
    for ind in range(1, len(_list)):
        x = _list[ind-1]
        y = _list[find]
        new_list.append((x, y))
    return new_list
ADD COMMENT

Login before adding your answer.

Traffic: 1575 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6