Question: Help with perl script to subset a fasta file?
gravatar for auninsaw
27 days ago by
auninsaw0 wrote:

Hi - I have a large fasta file of ~500 sequences, each of which has a known to me target position with a SNP of interest. For each entry in the fasta file, I have a separate tab delimited text file that has on each line the fasta sequence name, start position, stop position, and SNP position. The fasta sequences and the positions text file are in the same order. The dummy fasta file is:


...and the dummy positions file is:

AOS-94_S25_L002_R1_001_trimmed_contig_767   5   15  10
AOS-94_S25_L002_R1_001_trimmed_contig_2199  8   19  11
AOS-94_S25_L002_R1_001_trimmed_contig_2585  4   20  18

...where I want to extract positions 5-15 of the first sequence, and the snp is at position 10. Then, extract sequence 2 at bp 8-19, where the SNP is at position 11, etc.

This is the script I wrote:

use strict;
use warnings;
#Read in the complete fasta file:
print "What is the name of the fasta contig file?\n";
my $fasta = <STDIN>;
chomp $fasta;
#Read in file of contig name, start pos, stop pos, SNP pos in tab delimited text:
print "Name of text file with contig name and SNP position info? \n";
my $text = <STDIN>;
chomp $text;
#Output file
print "What are we calling the output? \n";
my $out= <STDIN>;
chomp $out;
local $/ = "\n>"; #Read by fasta record
my $seq1 = (); 
open(FASTA,$fasta) || die "\n Unable to open the file!\n";
open(POS,$text) || die "\n Unable to open the file! \n";
my @fields = <POS>;
    while (my $seq = <FASTA>){
        chomp $seq;
        my @seq = split(/\n/,$seq);
            if($seq[0] =~ /^>/){
                $seq1 = $seq[0];
            }elsif($seq[0] =~ /[^>]/){ #matches any character except the >
                $seq1 = ">".$seq[0];
    for my $pos (@fields){
        chomp $pos;
        my @field = split(/\t/,$pos);
    print OUTFILE "$seq1";
    my $subseq = substr $seq[1], $field[1] -1, $field[2] - $field[1]; 
    print OUTFILE "$subseq\n";
close FASTA;
close POS;
close OUTFILE;

It works in that I get the reduced fasta file I want. However, I want to enter these into batch primer for primer design. On each line after the fasta header, I need to have pos=(whatever tab 4 is, the position of the SNP). If I insert "print OUTFILE "$field[3]\n" after print OUTFILE "$seq[1]" it doesn't work right. I've screwed something up in my loop structure and chomping. Obviously I am struggling with some perl basics. Does anyone have any hints? Thanks for any insight/help.

snp script fasta perl • 183 views
ADD COMMENTlink modified 26 days ago • written 27 days ago by auninsaw0

Have you started working on this script on your own? You might find better advice for scripting on stack exchange.

ADD REPLYlink written 27 days ago by The Last Word140

I've started on it, but its admittedly a mess. I was having some trouble getting it posted up correctly but it looks ok now. I may try Stackexchange, thanks for the suggestion.

ADD REPLYlink written 27 days ago by auninsaw0

see also: Extract User Defined Region From An Fasta File

ADD REPLYlink written 27 days ago by Pierre Lindenbaum103k
gravatar for auninsaw
26 days ago by
auninsaw0 wrote:

Okay, here is a well organized answer I received from stack overflow, that solves my question using Perl and may help others with a similar problem. It is similar to my original in spirit but with correct formatting, comments, and pragma. Thanks for those who offered help.

use strict;
use warnings;

print "Name of the FASTA contig file: ";
chomp( my $fasta_file = <STDIN> );

print "Name file with SNP position info: ";
chomp( my $pos_file = <STDIN> );

print "Name of the output file: ";
chomp( my $out_file = <STDIN> );

open my $out_fh, '>', $out_file or die qq{Unable to open "out_file" for output: $!};

my @pos_records = do {
    open my $pos_, '<' , $pos_file or die qq{Unable to open "$pos_file" for input: $!};
chomp @pos_records; #remove all newlines  

     open my $fasta_fh, '<', $fasta_file or die qq{Unable to open "$fasta_file" for input: $!};

     local $/ = "\n>"; #Reading FASTA format now

     for ( <$fasta_fh> ) {

         chomp; #Remove ">\n" from the end

         my ( $header, $seq) = split /\n/; #separate the two lines

         $header = ">$header" unless $header =~ /^>/; # Replace any chomped >

     for ( @pos_records ) {

                my ($name,$beg,$end,$pos) = split /\t/;
                 my $subseq = substr $seq, $beg-1, $end-$beg;
             my $final_SNP = $end - $pos; 

                 if($header =~ /$name/){

                   print $out_fh "$header\n";
                   print $out_fh "pos=$final_SNP\n";
                   print $out_fh "$subseq\n";
} #local expires here

close $out_fh or die $!;
ADD COMMENTlink modified 26 days ago • written 26 days ago by auninsaw0

This way loops through all the the pos_records once for every fasta entry. That's not a terribly smart way to do this.

ADD REPLYlink written 14 days ago by swbarnes23.0k
gravatar for Pierre Lindenbaum
27 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum103k wrote:

assuming there is two lines per fasta file and the delimiter is the tabulation: use join and pipe into awk

join -t $'\t' -1 1 -2 1  \
      <(cat input.fa | paste - - | cut -c 2- | sort -t $'\t' -k1,1) \
      <(sort -t $'\t' -k1,1 jeter.txt) |\
awk '{printf(">%s %s %s %s\n%s\n",$1,$3,$4,$5,substr($2,int($3),int($4)-int($3)));}'

>AOS-94_S25_L002_R1_001_trimmed_contig_2199 8 19 11
>AOS-94_S25_L002_R1_001_trimmed_contig_2585 4 20 18
>AOS-94_S25_L002_R1_001_trimmed_contig_767 5 15 10
ADD COMMENTlink written 27 days ago by Pierre Lindenbaum103k

Thanks, Pierre. I am not that familiar with awk, but I will try and work through what you posted and figure out how it is working.

ADD REPLYlink written 27 days ago by auninsaw0
gravatar for swbarnes2
27 days ago by
United States
swbarnes23.0k wrote:

Yeah, I don't think your loops are right. You are going through the entire pos file within the fasta loop, and I don't see where you are specificing that you want only the line whose name matches the fasta name..

Personally, I'd arrange it something like this:

open (POS, "pos.txt") or die "Arrg!";
while (<POS)>) {
  my @pos = split "\t";
  $hash{$pos[0]}{'start'} = $pos[1];
  $hash{$pos[0]}{'end'}   = $pos[2];
  $hash{$pos[0]}{'snp'}   = $pos[3];

Then go through your fasta, and you have easy access to the numbers you need for each particular sequence.

Also a clever way to go through fasta files...change the line separator to ">" instead of new line. ( copied this from here:

$/ = ">";
my $junk = <IN>;
while ( my $record = <IN> ) {
  chomp $record;
  my ($defLine, @seqLines) = split /\n/, $record;
  my $sequence = join('',@seqLines);
ADD COMMENTlink modified 27 days ago • written 27 days ago by swbarnes23.0k

Thanks for the suggestion, I'm going to keep working at this and will post up when I get it working.

ADD REPLYlink written 27 days ago by auninsaw0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1397 users visited in the last hour