Question: How Do I Extract Sequence Ids From Fasta File?
1
gravatar for arronslacey
7.5 years ago by
arronslacey280
United Kingdom
arronslacey280 wrote:

simple question - although when I search google all i get is how to extract the actual sequence. anyone have a quick solution as how to read in a fasta file, and then extract all the ids in the same order they appear in the fasta file?

here is a snippet:

>gnl|TC-DB|P0A334 1.A.1.1.1 Voltage-gated potassium channel OS=Streptomyces lividans GN=kcsA PE=1 SV=1

and I only want the 'P0A344' part.

link to fasta file:

https://docs.google.com/file/d/0B0iDswLYaZ0zX1RJdGRrRUxiSEk/edit?usp=sharing

thanks!

fasta id protein • 12k views
ADD COMMENTlink modified 7.5 years ago by Neilfws49k • written 7.5 years ago by arronslacey280
1

If you don't mind keeping the ">" and easy grep would be a start. grep ">" input.fasta > headers.txt.

But some more information would be great. What did you find for extracting the sequence? Do you want to write a script doing it? Then it might still help to have a closer look on how to extract the sequence and change it to extract the header. Do you want the full header? And, as we currently are in discussion about that topic in another post: What did you already try?

ADD REPLYlink written 7.5 years ago by skymningen330

hi - updated with a snippet of the fasta file. out of that snippet i only want the 'P0A334' part, and then repeat for the other sequences.

ADD REPLYlink written 7.5 years ago by arronslacey280

Note that the ">" was not visible in your original question. Lines beginning with that character are formatted as blockquotes at BioStar. You need to indent the line with 4 spaces (done for you) to display it properly.

ADD REPLYlink written 7.5 years ago by Neilfws49k
7
gravatar for Pierre Lindenbaum
7.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:
 awk -F '[| ]' '/^>/ { print $3}' < your.fasta
ADD COMMENTlink written 7.5 years ago by Pierre Lindenbaum131k

excellent - thanks!

ADD REPLYlink written 7.5 years ago by arronslacey280
4
gravatar for Biojl
7.5 years ago by
Biojl1.7k
Barcelona
Biojl1.7k wrote:

You can use grep and filter lines that start with '>'. If you have a more complex scenario (i.e. it also contains description info) you'll have to generate the regular expression that best fits you.

UPDATE The code for your data should be something like:

grep "^>" file.fa | cut -c 12-17 > destination_file.txt

Where 12-17 are the character positions you want. You can play with it.

ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by Biojl1.7k
2

I would do the same but use cut with specific delimiters (-d) and columns (-f) in case the number of characters varies:

grep "^>" file.fa | cut -d'|' -f3 | cut -d ' ' -f1 > destination_file.txt

ADD REPLYlink modified 7.5 years ago • written 7.5 years ago by Richard Smith400

+1 for a more elegant solution.

ADD REPLYlink written 7.5 years ago by Biojl1.7k
2
gravatar for Emily_Ensembl
7.5 years ago by
Emily_Ensembl21k
EMBL-EBI
Emily_Ensembl21k wrote:

You could use some kind of Perl parsing script that identifies each line that begins >, which will be the FASTA header. You can then turn the line into the array and identify the ID from its position in the array (assuming the ID appears in the same position in every header) and print the ID. For example:

#!/usr/bin/perl
use strict;
use warnings;

# open the input file
open (FASTA, '<fasta.fa') or die $!;

# move through the lines of the input file, one by one
while (<FASTA>) {

    # look for header lines by finding > at the beginning
    if ( /^>/ ) {

        # get the ID from the header
        my @header = split /\s/, $_;
        my @array = split /\|/, @header[0];
        my $id = @array[2];

        # print ID
        print $id, "\n";

    }
}

# close the file
close FASTA;
ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by Emily_Ensembl21k

@Emily_Ensembl - i seem to be getting a list of "n" 's when running this. so i can see it is walking through the file, but doesn't seem to strip out the id.

here is the link to the fast file: https://docs.google.com/file/d/0B0iDswLYaZ0zX1RJdGRrRUxiSEk/edit?usp=sharing

ADD REPLYlink written 7.5 years ago by arronslacey280

Fixed above. The line "my @array = split /\|/, @header[0];" was missing a backslash so it was splitting on characters rather than |s.

ADD REPLYlink written 7.5 years ago by Emily_Ensembl21k

There are a couple of things in your script that will cause problems. When you want a single element from an array, remember it is just a regular scalar variable so you would write $array[0] and not @array[0]. The syntax you used is for a slice and that is not what you want (because you have warnings enabled, Perl will tell you that). Also, this is bad because assigning array elements to a scalar with a slice will usually give you a "Use of uninitialized value ..." warning if you try to use it (this assignment actually worked with a single element but this is not good form). You can avoid this issue because you can print array or hash elements directly, so there is no need in the extra step.

The last thing is to never use bare filehandles and use the 3-argument version of open. These are considered "best practices" because they are safer ways to deal with files. Hope that helps.

ADD REPLYlink modified 7.5 years ago • written 7.5 years ago by SES8.4k
0
gravatar for Anima Mundi
7.5 years ago by
Anima Mundi2.8k
Italy
Anima Mundi2.8k wrote:

In Python:

for line in open('inputfile'):
    if '>' in line:
        print line,
ADD COMMENTlink written 7.5 years ago by Anima Mundi2.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1126 users visited in the last hour