Question: How to extract intronic regions from a Genbank file?
0
gravatar for Evergreen
11 months ago by
Evergreen0
Evergreen0 wrote:

Hello,

I am a newbie in programming languages. I would like to extrac intronic sequences from Genbank files (from NCBI). The genomes are really big, at least 30 MB.

I try to create a BioPython script which extracts the intronic sequences in FASTA format into a new file. The intron sequences are the complementer of the CDS sequences in a Genebank file.

I found two similar scripts but I hardly understand the logic behind them, I am struggling. https://biopython.org/wiki/Intergenic_regions

https://github.com/dewshr/NCBI-Genbank-file-parser

Perhaps little changes could solve my problem. I am interested in every solution besides Python.

Please help me!

Thank you in advance!

extract intron python genome • 593 views
ADD COMMENTlink modified 10 months ago by Alex Reynolds29k • written 11 months ago by Evergreen0

Previous thread: Extract Intergenic Sequences from GenBank File with Perl

This is referring to intergenic regions but the idea is similar. Substitute introns for those.

ADD REPLYlink modified 11 months ago • written 11 months ago by genomax75k

You may wish to try this one gff2fasta by specifying the feature (--feature; Line 499 in the Perl script).

ADD REPLYlink written 11 months ago by Sishuo Wang180

Also take a look at A: how to get intronic and intergenic sequences based on gff file? to get an idea about the logic of extracting genomic features from annotation files.

ADD REPLYlink written 11 months ago by ATpoint26k
1
gravatar for Alex Reynolds
10 months ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

A set operation approach would require splitting out exons per transcript, which would work but may involve many thousands of small disk operations, which cumulatively may be very slow.

One way to do this would be to first obtain the genes. The following code shows using Gencode annotations (GFF-formatted) but the idea is the same for any annotations in GFF (or GTF, etc.).

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
    | gunzip --stdout - \
    | awk '$3 == "gene"' \
    | convert2bed -i gff - \
    > genes.bed

Make a second file containing exons:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
    | gunzip --stdout - \
    | awk '$3 == "exon"' \
    | convert2bed -i gff - \
    > exons.bed

Pull gene IDs from the genes.bed file, preserving the order they are discovered in the genes file:

$ cut -f4 genes.bed > ordered_ids.txt

Use a script to walk through this ID list and the exons file in order, to write a set of introns to standard output:

#!/usr/bin/env perl

use strict;
use warnings;

my $id_fn = "ordered_ids.txt";
my $exons_fn = "exons.bed";
open my $id_fh, "<", $id_fn or die "could not open ID file\n";
open my $exons_fh, "<", $exons_fn or die "could not open exons file\n";
my $result;
my $query_id = <$id_fh>;
while (my $exon_ln = <$exons_fh>) {
    chomp $exon_ln;
    my @exon_elems = split("\t", $exon_ln);
    my $exon_id = $exon_elems[3];
    if ($exon_id =~ /$query_id/) {
        # push another exon into result hash table
        my $exon_chr = $exon_elems[0];
        my $exon_start = $exon_elems[1];
        my $exon_stop = $exon_elems[2];
        $result->{$query_id}->{chr} = $exon_chr;
        push @{$result->{$query_id}->{starts}), $exon_start;
        push @{$result->{$query_id}->{stops}), $exon_stop;
    }
    else {
        # write out introns from gaps between previous exon stop and current exon start 
        if (defined $result->{$query_id}->{chr}) {
            my $result_chr = $result->{$query_id}->{chr};
            my @result_starts = @{$result->{$query_id}->{starts};
            my @result_stops = @{$result->{$query_id}->{stops};
            my $result_exon_count = scalar @result_starts;
            for (my $exon_idx = 1; $exon_idx < $result_exon_count - 1; $exon_idx++) {
                my $result_start = $result_stops[$exon_idx - 1]; # previous stop
                my $result_stop = $result_starts[$exon_idx]; # current start
                print STDOUT "$result_chr\t$result_start\t$result_stop\t$query_id\n";
            }
        }
        $query_id = <$id_fh>;  # grab next ID to query
    }
}
close $id_fh;
close $exons_fh;

I think this approach should be faster, because instead of making lots of tiny files to do set operations on, this steps through both files in order only once, printing out introns as their bounds are found.

While exons.bed is in sort-bed order, it isn't guaranteed that the output from this Perl script is in sort-bed order. So just pipe it in:

$ make_introns_from_exons_and_ids.pl | sort-bed - > introns.bed

If you need sequence data, it's just one more step to convert BED to FASTA. You would then run your BED-formatted regions through scripted calls to samtools faidx, to convert to strand-aware, FASTA-formatted sequence, using your assembly of choice. I outline one such approach in my Stack Exchange answer here: https://bioinformatics.stackexchange.com/a/5374/776

ADD COMMENTlink modified 10 months ago • written 10 months ago by Alex Reynolds29k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1029 users visited in the last hour