Question: Extracting information from a large fasta header formatted file
0
gravatar for User 6777
4.3 years ago by
User 677720
United States
User 677720 wrote:

I have a large fasta header formatted file (input.txt) like:

>NC_23689
#
# XYZ
# Copyright (c)  BLASC
#
# Predicted binding regions
#   No.                Start         End      Length
#   1                      1          25          25
#   2                     39          47           9
#
>68469409
#
# XYZ
# Copyright (c)  BLASC
#
# Predicted binding regions
#   None.
#
# Prediction profile output:
#   Columns:
#   1 - Amino acid number
#   2 - One letter code
#   3 -  probability value
#   4 - output
#
1   M     0.1325        0
2   S     0.1341        0
3   S     0.1384        0
>68464675
#
# XYZ
# Copyright (c)  BLASC
#
# Predicted binding regions
#   No.                Start         End      Length
#   1                     13          24          12
#   2                     31          53          23
#   3                     81          95          15
#   4                    115         164          50
#
...
...

I want to extract each fasta header and its corresponding start-end value(s) (after Predicted binding regions line) in a (output.txt file). For the above (input.txt), the output will be:

NC_23689: 1-25, 39-47
68464675: 13-24, 31-53, 81-95, 115-164

I have tried some regular perl and python scripts ut it looks like not that much straightforward to me. I was wondering whats the best way to get these lines out from the input file for every single header?

Thanks

file parsing python perl • 1.0k views
ADD COMMENTlink modified 4.3 years ago by Jean-Karim Heriche24k • written 4.3 years ago by User 677720
1

Interesting way of describing the file (fasta header formatted). A fasta formatted file is expected to have sequence data in it (protein or DNA). This has neither.

ADD REPLYlink written 4.3 years ago by GenoMax93k
2
gravatar for Jean-Karim Heriche
4.3 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche24k wrote:

As your file is not a FASTA file, you can't use existing FASTA parsers. Here is a naive perl approach. Modify as appropriate.

#!/usr/bin/perl

use strict;
use warnings;

$/ = ">";
my %data;
open my $fh,"<", $ARGV[0] or die "\nERROR: Can't read file $ARGV[0]: $!";
while(my $record =<$fh>) {
  my ($header, @table) = split(/\n/,$record);
  my $read = 0;
  foreach my $i(0..$#table) {
    if ($table[$i]=~/Predicted binding region/i && $table[$i+1]=~/Start\s+End/i) {
      $read = 1;
      next
    }
    if ($read) {
      next if ($table[$i]=~/Start\s+End/i);
      my @cells = split(/\s+/,$table[$i]);
      next unless ($cells[1]);
      push @{$data{$header}},"$cells[2]-$cells[3]";
    }
  }
}
close $fh;
foreach my $header(keys %data) {
  print "$header: ",join(", ",@{$data{$header}}),"\n";
}
ADD COMMENTlink written 4.3 years ago by Jean-Karim Heriche24k

Exactly what I need. Thank you

ADD REPLYlink written 4.3 years ago by User 677720

Go ahead and accept the answer (use the check mark against @Jean-Karim's post.

ADD REPLYlink written 4.3 years ago by GenoMax93k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1175 users visited in the last hour