Insert A Sequence Into A Big Genome Fasta File + Shift The Annotation File
2
2
Entering edit mode
11.0 years ago

Hi,

I want to insert a custom sequence into the human genome at a specific position. My sequence is about ~20kb long. So :

  1. How can I insert this sequence at a specific position (I thought using bioperl)
  2. How can I shift all the features in the annotation file in order that the annotation is correct after insertion. The difficulty is that the insertion can be into a feature (like an intron or an exon...). I thought to write a perl script to do that but if anyone has an other idea ..

Thanks,

N.

sequence gtf • 4.7k views
ADD COMMENT
0
Entering edit mode

why are you doing that?

ADD REPLY
1
Entering edit mode

I guess because it reflects biological reality e.g. a knock-in?

ADD REPLY
0
Entering edit mode

exactly ! For the insertion of the sequence it's ok, I will put my answer below. The problem is the sifting in the gtf file.

ADD REPLY
2
Entering edit mode
11.0 years ago

Here is my solution for the first point : the insertion step. I wrote it in bioperl. It's not super clean but it works :

#!/usr/bin/perl

use warnings;
use strict;
use Bio::SeqIO;
use Bio::Seq;

my $file = $ARGV[0]; # input fasta file (genome file)
my $out = $ARGV[1]; # output fasta file

my $chr="test"; #insertion chromosome
my $pos=10; # position of the insertion
my $seqI = "AAAA"; #sequence of the insertion

my $seq_in  = Bio::SeqIO->new( -format => 'fasta',-file => $file);
my $seq_out = Bio::SeqIO->new( -format => 'fasta',-file => ">".$out);
while( my $seq = $seq_in->next_seq() ) {    
    if($seq->primary_id eq $chr){
        my $length = length($seq->seq);    
        my $upstream=substr($seq->seq, 0, $pos);
        my $downstream=substr($seq->seq, $pos,$length);        
        my $seq_obj = Bio::Seq->new(-seq => $upstream.$seqI.$downstream,-display_id => $seq->primary_id,-alphabet => "dna" );
            $seq_out->write_seq($seq_obj);
    }
    else{
        $seq_out->write_seq($seq);
    }
}
ADD COMMENT
2
Entering edit mode
11.0 years ago
Michael 54k

For the purpose of adjusting the GTF file it should be sufficient to

  • shift all genomic coordinates (start, end) with start,end >= insert_start by the insert size (that means add the insert size)
  • leave all other coordinates untouched
  • if insert length is not a multiple of 3, frame information needs to be adjusted for all regions that have frame information and a start position >= insert_start, if frame > 0. For regions with frame < 0, adjust frame if end >= insert start. This can be calculated as new_frame := (old_frame + (insert_length %% 3)) %% 3
ADD COMMENT

Login before adding your answer.

Traffic: 3332 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6