Question

Appending Identifiers To The Fasta Header: Multiple Files

0

Entering edit mode

12.0 years ago

himanshu17 ▴ 20

Hi,

I have a collection of 200 bacterial genome sequences with each file corresponding to separate bacterial species. The names of the sequence files are the corresponding bacterial species (e.g Nostoc punctiforme.fasta and 199 more). The sequences are downloaded from multiple databases so they have different header formats.

I am working on some perl scripts which can append my identifier ( e.g.NOPU ) to each fasta entry of the corresponding genome (So that each entry of the Nostoc punctiforme.fasta has NOPU appended after >) i am able to append and modify long identifiers through these scripts which asks for the file name and the identifier . In order to automate this process i have created a text file with two column, the first column for Filename and second column for identifier and each line having new entry e.g

**Filename**                **My identifier**
Nostoc punctiforme          NOPU
Prochlorococcus marinus     PRMA

and 198 more entries

the idea behind this is to write a script which asks for a file name, searches the file name as well as identifier from the text file and appends the corresponding identifier to each entry of the fasta file and makes a new output file with the same name as the identifier or a script that asks for a directory of fasta file and picks fasta files one by one, look up the text file for identifier and do the needful.

However i am failing to have any lead in this direction and request people out there to help me.

Thanks

fasta perl • 6.8k views

ADD COMMENT • link updated 12.0 years ago by Neilfws 49k • written 12.0 years ago by himanshu17 ▴ 20

score 4 · Answer 1 · 2012-04-09

This is a common bioinformatics problem which we can generalise as "I want to read a fasta file and do something with it." Since you are using Perl, you should be using Bioperl and in particular, the SeqIO module.

You need to break the problem down into a series of steps. They are:

Open file containing file names and identifiers
Step through each line and extract file name
Open corresponding fasta file and append identifier to header

I'll assume that your fasta files and the file of file names are in the same directory. I'll also assume that your fasta file names are as given, i.e. they contain a space and have no suffix. This is not good practice and I'd suggest renaming the files to something like Nostoc_punctiforme.fa. You don't say whether the fasta files contain protein or nucleotide sequences but a good convention for suffixes is that used by the NCBI: .faa for protein, .ffn for nucleotide and .fna for nucleotide single sequence (e.g. a chromosome).

I'll also assume that (1) the columns of your text file are tab-separated, (2) the file is called ids.txt, (3) you want to write to a new file with the suffix .mod, containing fasta sequences with the modified header and (4) the identifier should be inserted directly after the ">" in the header.

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;

# open text file
my $file = "ids.txt";
open IN, $file;
while(<IN>) {
  # remove line endings and split on tab
  chomp;
  my @line = split("\t", $_);
  # check fasta file exists and open
  if(-e $line[0]) {
    my $fa = Bio::SeqIO->new(-file => $line[0], -format => "fasta");
    # create new fasta file for modified output
    my $outseq = Bio::SeqIO->new(-file => ">$line[0].mod", -format => "fasta");
    while(my $seq = $fa->next_seq) {
      my $id = $seq->display_id;
      # modify header
      $id = $line[1].$id;
      # modified sequence object
      my $newseq = Bio::Seq->new(-id => $id, -seq => $seq->seq);
      # write modified sequence to file
      $outseq->write_seq($newseq);
    }
  }
}
close IN;

Result using a small test file named Nostoc_punctiforme:

INPUT Nostoc_punctiforme
>seq1
agagagagagagaga  
>seq2
gatacatagatacag

OUTPUT Nostoc_punctiforme.mod
>NOPUseq1
agagagagagagaga
>NOPUseq2
gatacatagatacag

score 3 · Answer 2 · 2012-04-09

One the solution would be to:

read your sample file line by line
split each line int filename and postfix tag
now read the file line by line
if the line starts with the > symbol append the postfix tag

In python would look like so:

for sample in file('samples.txt'):
    fname, postfix = sample.split()
    output = file( "%s.new" % fname, 'wt')
    for line in file(fname):
        line = line.strip()
        if line.startswith(">"):
            line = line + postfix
        output.write("%s\n" % line)

You can also invert the situation to first preload your samples into a dictionary like so:

store = {}
for sample in file('samples.txt'):
    fname, postfix = sample.split()
    store[fname]=postfix

fname = "one.fasta"
postfix = store[fname]
output = file( "%s.new" % fname, 'wt')
for line in file(fname):
    line = line.strip()
    if line.startswith(">"):
        line = line + postfix
    output.write("%s\n" % line)

score 1 · Answer 3 · 2012-04-09

1

Entering edit mode

12.0 years ago

Rm 8.3k

try this: with "Filename My identifier" file (asuming tab delimited)

cat Filenames.txt | while read LINE; do FNAME="$(echo $LINE | cut -d"\t" -f1 )"; ID="$(echo $LINE | cut -d"\t" -f2)"; awk -v code="ID" '{sub(/^>/, "$code", $0); print}' ${FNAME}.fasta ; done

I haven't tested though

ADD COMMENT • link 12.0 years ago by Rm 8.3k