Appending Identifiers To The Fasta Header: Multiple Files
3
0
Entering edit mode
12.0 years ago
himanshu17 ▴ 20

Hi,

I have a collection of 200 bacterial genome sequences with each file corresponding to separate bacterial species. The names of the sequence files are the corresponding bacterial species (e.g Nostoc punctiforme.fasta and 199 more). The sequences are downloaded from multiple databases so they have different header formats.

I am working on some perl scripts which can append my identifier ( e.g.NOPU ) to each fasta entry of the corresponding genome (So that each entry of the Nostoc punctiforme.fasta has NOPU appended after >) i am able to append and modify long identifiers through these scripts which asks for the file name and the identifier . In order to automate this process i have created a text file with two column, the first column for Filename and second column for identifier and each line having new entry e.g

**Filename**                **My identifier**
Nostoc punctiforme          NOPU
Prochlorococcus marinus     PRMA

and 198 more entries

the idea behind this is to write a script which asks for a file name, searches the file name as well as identifier from the text file and appends the corresponding identifier to each entry of the fasta file and makes a new output file with the same name as the identifier or a script that asks for a directory of fasta file and picks fasta files one by one, look up the text file for identifier and do the needful.

However i am failing to have any lead in this direction and request people out there to help me.

Thanks

fasta perl • 6.8k views
ADD COMMENT
4
Entering edit mode
12.0 years ago
Neilfws 49k

This is a common bioinformatics problem which we can generalise as "I want to read a fasta file and do something with it." Since you are using Perl, you should be using Bioperl and in particular, the SeqIO module.

You need to break the problem down into a series of steps. They are:

  1. Open file containing file names and identifiers
  2. Step through each line and extract file name
  3. Open corresponding fasta file and append identifier to header

I'll assume that your fasta files and the file of file names are in the same directory. I'll also assume that your fasta file names are as given, i.e. they contain a space and have no suffix. This is not good practice and I'd suggest renaming the files to something like Nostoc_punctiforme.fa. You don't say whether the fasta files contain protein or nucleotide sequences but a good convention for suffixes is that used by the NCBI: .faa for protein, .ffn for nucleotide and .fna for nucleotide single sequence (e.g. a chromosome).

I'll also assume that (1) the columns of your text file are tab-separated, (2) the file is called ids.txt, (3) you want to write to a new file with the suffix .mod, containing fasta sequences with the modified header and (4) the identifier should be inserted directly after the ">" in the header.

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;

# open text file
my $file = "ids.txt";
open IN, $file;
while(<IN>) {
  # remove line endings and split on tab
  chomp;
  my @line = split("\t", $_);
  # check fasta file exists and open
  if(-e $line[0]) {
    my $fa = Bio::SeqIO->new(-file => $line[0], -format => "fasta");
    # create new fasta file for modified output
    my $outseq = Bio::SeqIO->new(-file => ">$line[0].mod", -format => "fasta");
    while(my $seq = $fa->next_seq) {
      my $id = $seq->display_id;
      # modify header
      $id = $line[1].$id;
      # modified sequence object
      my $newseq = Bio::Seq->new(-id => $id, -seq => $seq->seq);
      # write modified sequence to file
      $outseq->write_seq($newseq);
    }
  }
}
close IN;

Result using a small test file named Nostoc_punctiforme:

INPUT Nostoc_punctiforme
>seq1
agagagagagagaga  
>seq2
gatacatagatacag

OUTPUT Nostoc_punctiforme.mod
>NOPUseq1
agagagagagagaga
>NOPUseq2
gatacatagatacag
ADD COMMENT
3
Entering edit mode
12.0 years ago

One the solution would be to:

  1. read your sample file line by line
  2. split each line int filename and postfix tag
  3. now read the file line by line
  4. if the line starts with the > symbol append the postfix tag

In python would look like so:

for sample in file('samples.txt'):
    fname, postfix = sample.split()
    output = file( "%s.new" % fname, 'wt')
    for line in file(fname):
        line = line.strip()
        if line.startswith(">"):
            line = line + postfix
        output.write("%s\n" % line)

You can also invert the situation to first preload your samples into a dictionary like so:

store = {}
for sample in file('samples.txt'):
    fname, postfix = sample.split()
    store[fname]=postfix

fname = "one.fasta"
postfix = store[fname]
output = file( "%s.new" % fname, 'wt')
for line in file(fname):
    line = line.strip()
    if line.startswith(">"):
        line = line + postfix
    output.write("%s\n" % line)
ADD COMMENT
0
Entering edit mode

Hi! I tried this using a tab separated file called samples.txt and I'm getting the following error: NameError: name 'file' is not defined

I'm not sure what's wrong?

ADD REPLY
1
Entering edit mode
12.0 years ago
Rm 8.3k

try this: with "Filename My identifier" file (asuming tab delimited)

cat Filenames.txt | while read LINE; do FNAME="$(echo $LINE | cut -d"\t" -f1 )"; ID="$(echo $LINE | cut -d"\t" -f2)"; awk -v code="ID" '{sub(/^>/, "$code", $0); print}' ${FNAME}.fasta ; done

I haven't tested though

ADD COMMENT

Login before adding your answer.

Traffic: 2794 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6