Question: Help with perl script to subset a fasta file?
0
gravatar for auninsaw
8 months ago by
auninsaw10
auninsaw10 wrote:

Hi - I have a large fasta file of ~500 sequences, each of which has a known to me target position with a SNP of interest. For each entry in the fasta file, I have a separate tab delimited text file that has on each line the fasta sequence name, start position, stop position, and SNP position. The fasta sequences and the positions text file are in the same order. The dummy fasta file is:

>AOS-94_S25_L002_R1_001_trimmed_contig_767
GACACACACTGATTGTTAGTGGTGTACAGACATTGCTTCAAACTGCA
>AOS-94_S25_L002_R1_001_trimmed_contig_2199
TAGGTTTTCTTTCCCATGTCCCCTGAATAACATGGGATTCCCTGTGACTGTGGGGACCCCTGAGAGCCTGGT
>AOS-94_S25_L002_R1_001_trimmed_contig_2585
GATAAGGAGCTCACAGCAACCCACATGAGTTGTCC

...and the dummy positions file is:

AOS-94_S25_L002_R1_001_trimmed_contig_767   5   15  10
AOS-94_S25_L002_R1_001_trimmed_contig_2199  8   19  11
AOS-94_S25_L002_R1_001_trimmed_contig_2585  4   20  18

...where I want to extract positions 5-15 of the first sequence, and the snp is at position 10. Then, extract sequence 2 at bp 8-19, where the SNP is at position 11, etc.

This is the script I wrote:

#!/usr/bin/perl
use strict;
use warnings;
#Read in the complete fasta file:
print "What is the name of the fasta contig file?\n";
my $fasta = <STDIN>;
chomp $fasta;
#Read in file of contig name, start pos, stop pos, SNP pos in tab delimited text:
print "Name of text file with contig name and SNP position info? \n";
my $text = <STDIN>;
chomp $text;
#Output file
print "What are we calling the output? \n";
my $out= <STDIN>;
chomp $out;
local $/ = "\n>"; #Read by fasta record
my $seq1 = (); 
open(FASTA,$fasta) || die "\n Unable to open the file!\n";
open(POS,$text) || die "\n Unable to open the file! \n";
my @fields = <POS>;
    while (my $seq = <FASTA>){
        chomp $seq;
        my @seq = split(/\n/,$seq);
            if($seq[0] =~ /^>/){
                $seq1 = $seq[0];
            }elsif($seq[0] =~ /[^>]/){ #matches any character except the >
                $seq1 = ">".$seq[0];
            }
    for my $pos (@fields){
        chomp $pos;
        my @field = split(/\t/,$pos);
    open(OUTFILE,">>$out");
    print OUTFILE "$seq1";
    my $subseq = substr $seq[1], $field[1] -1, $field[2] - $field[1]; 
    print OUTFILE "$subseq\n";
    }   
}
close FASTA;
close POS;
close OUTFILE;

It works in that I get the reduced fasta file I want. However, I want to enter these into batch primer for primer design. On each line after the fasta header, I need to have pos=(whatever tab 4 is, the position of the SNP). If I insert "print OUTFILE "$field[3]\n" after print OUTFILE "$seq[1]" it doesn't work right. I've screwed something up in my loop structure and chomping. Obviously I am struggling with some perl basics. Does anyone have any hints? Thanks for any insight/help.

snp script fasta perl • 414 views
ADD COMMENTlink modified 8 months ago • written 8 months ago by auninsaw10

Have you started working on this script on your own? You might find better advice for scripting on stack exchange.

ADD REPLYlink written 8 months ago by The Last Word160

I've started on it, but its admittedly a mess. I was having some trouble getting it posted up correctly but it looks ok now. I may try Stackexchange, thanks for the suggestion.

ADD REPLYlink written 8 months ago by auninsaw10

see also: Extract User Defined Region From An Fasta File

ADD REPLYlink written 8 months ago by Pierre Lindenbaum111k
0
gravatar for auninsaw
8 months ago by
auninsaw10
auninsaw10 wrote:

Okay, here is a well organized answer I received from stack overflow, that solves my question using Perl and may help others with a similar problem. It is similar to my original in spirit but with correct formatting, comments, and pragma. Thanks for those who offered help.

#!/usr/bin/perl
use strict;
use warnings;

print "Name of the FASTA contig file: ";
chomp( my $fasta_file = <STDIN> );

print "Name file with SNP position info: ";
chomp( my $pos_file = <STDIN> );

print "Name of the output file: ";
chomp( my $out_file = <STDIN> );

open my $out_fh, '>', $out_file or die qq{Unable to open "out_file" for output: $!};

my @pos_records = do {
    open my $pos_, '<' , $pos_file or die qq{Unable to open "$pos_file" for input: $!};
    <$pos_>;
};
chomp @pos_records; #remove all newlines  

{
     open my $fasta_fh, '<', $fasta_file or die qq{Unable to open "$fasta_file" for input: $!};

     local $/ = "\n>"; #Reading FASTA format now

     for ( <$fasta_fh> ) {

         chomp; #Remove ">\n" from the end

         my ( $header, $seq) = split /\n/; #separate the two lines

         $header = ">$header" unless $header =~ /^>/; # Replace any chomped >


     for ( @pos_records ) {

                my ($name,$beg,$end,$pos) = split /\t/;
                 my $subseq = substr $seq, $beg-1, $end-$beg;
             my $final_SNP = $end - $pos; 

                 if($header =~ /$name/){

                   print $out_fh "$header\n";
                   print $out_fh "pos=$final_SNP\n";
                   print $out_fh "$subseq\n";
      }
    } 
  }
} #local expires here

close $out_fh or die $!;
ADD COMMENTlink modified 8 months ago • written 8 months ago by auninsaw10
1

This way loops through all the the pos_records once for every fasta entry. That's not a terribly smart way to do this.

ADD REPLYlink written 7 months ago by swbarnes23.9k
1
gravatar for Pierre Lindenbaum
8 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum111k wrote:

assuming there is two lines per fasta file and the delimiter is the tabulation: use join and pipe into awk

join -t $'\t' -1 1 -2 1  \
      <(cat input.fa | paste - - | cut -c 2- | sort -t $'\t' -k1,1) \
      <(sort -t $'\t' -k1,1 jeter.txt) |\
awk '{printf(">%s %s %s %s\n%s\n",$1,$3,$4,$5,substr($2,int($3),int($4)-int($3)));}'

>AOS-94_S25_L002_R1_001_trimmed_contig_2199 8 19 11
TCTTTCCCATG
>AOS-94_S25_L002_R1_001_trimmed_contig_2585 4 20 18
AAGGAGCTCACAGCAA
>AOS-94_S25_L002_R1_001_trimmed_contig_767 5 15 10
CACACTGATT
ADD COMMENTlink written 8 months ago by Pierre Lindenbaum111k

Thanks, Pierre. I am not that familiar with awk, but I will try and work through what you posted and figure out how it is working.

ADD REPLYlink written 8 months ago by auninsaw10
1
gravatar for swbarnes2
8 months ago by
swbarnes23.9k
United States
swbarnes23.9k wrote:

Yeah, I don't think your loops are right. You are going through the entire pos file within the fasta loop, and I don't see where you are specificing that you want only the line whose name matches the fasta name..

Personally, I'd arrange it something like this:

open (POS, "pos.txt") or die "Arrg!";
while (<POS)>) {
  chomp;
  my @pos = split "\t";
  $hash{$pos[0]}{'start'} = $pos[1];
  $hash{$pos[0]}{'end'}   = $pos[2];
  $hash{$pos[0]}{'snp'}   = $pos[3];
}

Then go through your fasta, and you have easy access to the numbers you need for each particular sequence.

Also a clever way to go through fasta files...change the line separator to ">" instead of new line. ( copied this from here:http://seqanswers.com/forums/showthread.php?t=24777)

$/ = ">";
my $junk = <IN>;
while ( my $record = <IN> ) {
  chomp $record;
  my ($defLine, @seqLines) = split /\n/, $record;
  my $sequence = join('',@seqLines);
}
ADD COMMENTlink modified 8 months ago • written 8 months ago by swbarnes23.9k

Thanks for the suggestion, I'm going to keep working at this and will post up when I get it working.

ADD REPLYlink written 8 months ago by auninsaw10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1491 users visited in the last hour