Question

Perl script: generate pseudo-CDSs from intergenic region

0

Entering edit mode

9.4 years ago

biolab ★ 1.4k

Hi everyone

Recently I read a paper (Genetica (2009) 137:159-164), in which the authors generate pseudo-CDSs (in other words, negative dataset) based on intergenic sequences. They used an in-house script to randomly exact sequences from intergenic region. These pseudo-CDSs have the same number of sequences and a similar length distribution as genuine CDSs.

I am wondering if someone can share this kind of perl script with me. I appreciate your kind helps. Any comments (e.g. available tools) will be also helpful. Thank you very much!

perl • 4.1k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 9.4 years ago by biolab ★ 1.4k

Ram · Answer 1 · 2014-11-27

3

Entering edit mode

9.4 years ago

PoGibas 5.1k

I use bedtools shuffle for this, input:

CDS.bed (bedtools shuffle will reposition each feature in the input BED file on a random chromosome at a random position. The size and strand of each feature are preserved).
Genome (chromosome sizes).
Regions to exclude (in your case this will be protein coding gene coordinates (as you want negative dataset). Also see my question: Genomic Regions To Exclude Before Shuffling Intervals).

I would also recommend to use -chrom option.

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by PoGibas 5.1k

0

Entering edit mode

Hi Pgibas, your answer is really helpful. Thanks a lot!

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by biolab ★ 1.4k

Ram · Answer 2 · 2015-02-25

Hi there,

I wrote a script in Perl (hereunder). It's very simple. You need to copy and past it in a text file (use gedit or vim to make the file, gedit is very simple) and name the file like gbk2intergenic.pl

#AUTHOR: MAHMOUD AL-BASSAM 02/25/2015 UCSD
#Intergenic sequence extraction script from any bacterial GenBank file
#The script takes into account tRNA and rRNA genes
#Run script from Termial as follows:
#perl gbk2intergenic.pl  <Your .gbk file> <>file_name.fasta> (don't put "<>")
#THE FIRST and LAST GENES DON'T HAVE INTERGENIC SEQUENCE!

use strict;
use Bio::SeqIO;

my $file = $ARGV[0];
my $in = Bio::SeqIO->new(-file=>"$file", -format=>"GenBank");
my $obj = $in->next_seq();

my @features = $obj->get_SeqFeatures();
my @two = shift @features;
        foreach my $fefe  (@features){
        my $pt = $fefe->primary_tag();
                if ($pt eq "CDS" or $pt eq "rRNA" or $pt eq "tRNA") {
                push @two, $fefe;
                my $endi = $two[0]->end();
                my $starti = $two[1]->start();
                my $end = $endi +1;
                my $start = $starti -1;
                my $subseq = $obj->subseq($end,$start) unless ($end>=$start);
                my $strand1 = $two[0]->strand();
                my $strand2 = $two[1]->strand();
                my $dir1= $strand1 == "1" ?   "for" :   "rev"; #ternary condition
                my $dir2=  $strand2 == "1" ?   "for" :   "rev";
                my ($locus1) = $two[0]->get_tag_values("locus_tag") if ($two[0]->primary_tag() eq "CDS"
                or $two[0]->primary_tag() eq "rRNA" or $two[0]->primary_tag() eq "tRNA");
                my ($locus2) = $two[1]->get_tag_values("locus_tag") if ($two[1]->primary_tag() eq "CDS"
                or $two[1]->primary_tag() eq "rRNA" or $two[1]->primary_tag() eq "tRNA");
                print ">$dir1$locus1$dir2$locus2\n$subseq\n"  unless ($end>=$start);
                shift @two;
                }
        }

the output should be like this

>forCLJU_c00180forCLJU_c00190
TATAATAATTATATTTAAACGTAGGGCAT
>forCLJU_c00190forCLJU_c00200
TTATACTTTTAATAATGGCTTAAATATAGTTACTTAAGGAAGTTATTCATTAAAATGGTTACTTCTTTTTTTATTTACTAGGATAGGTATATAAATTTTAACTTGATATAGTAAATAGCCTTGATTTT
AATTGGTATTTGTGTTAATATAACACTGTTAATTT
>forCLJU_c00200revCLJU_c00210
TAATCCATTCCAAACACTGACTTCCAGTGTTTTTTATTTTATCTAAAATCAGATAATTGTCCCCATTTTGTCCCCAAAGTTTTCAATGGGGCATT
>revCLJU_c00210revCLJU_c00220

for means the gene is on the forward strand

rev reverse strand

Good luck!