Extracting multiple sequences by ID from a FASTA file
5
1
Entering edit mode
7.8 years ago
Allan ▴ 10

Following the instructions as shown in Extracting Multiple Fasta Sequences At A Time From A File Containing Many Sequences for retrieving a list of sequences by ID, the only instance this works correctly is for the last ID in the list file.

Please note BioPerl is not an option (HPCC management teams inability to install correctly).

Given a FASTA file Fec2_001.fa:

>M02471:1:000000000-A6VM6:1:1101:18323:1737 1:N:0:6
GTCACCAAGACCGCGCAGACCGGCGAAATGTATTATTCCTTGCCGCAGCGCATGATTCTGCCGGGCTACAACCCGACCACCAAGGCGCACGGTCGCGTGCT
>M02471:1:000000000-A6VM6:1:1101:14489:1760 1:N:0:6
CACATGCCCTTGTATCAGGAGCTGCGTCGTCGCATGGATGTCGGCGAGTTCGGCCGCATGAATCTCGCTCAGCTCAACTTTGGCAGCTATAAGGAGTACGG
>M02471:1:000000000-A6VM6:1:1101:14668:1789 1:N:0:6
AGCTCTTGAGCTCGCTCGTATCGCCCTTGTCCTTCTTCGTCTTCCAGTAATGACCCGTCACGCTTCCGTCTTGAACCGTATAGGTCAGATCTTCGACGCG

And list of ID's that need to be extracted in BLASTids.txt:

M02471:1:000000000-A6VM6:1:1101:18323:1737 1:N:0:6
M02471:1:000000000-A6VM6:1:1101:14668:1789 1:N:0:6

I've tried the following, but each time the only sequence returned is that matching the last ID. I'm assuming the newline characters are cause of this, but my lack perl programming isn't leading me to any solution.

First:

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' BLAST.txt Fec2_001.fa

Second (from biostar link above):

use warnings;
use strict;

my $lookup = shift @ARGV;  # ID to extract

local $/ = "\n>";  # read by FASTA record

while (my $seq = <>) {
    chomp $seq;
    my ($id) = $seq =~ /^>*(\S+)/;  # parse ID as first word in FASTA header
    if ($id eq $lookup) {
        print "$seq\n";
        last;
    }
}

These two are among several other solutions other's have shown to work. Any guidance is greatly appreciated!

FASTA • 8.1k views
ADD COMMENT
5
Entering edit mode
7.8 years ago
cts ★ 1.7k

Use seqtk which has a subseq command for extracting multiple reads using their headers

ADD COMMENT
0
Entering edit mode

Wonderful. Thank you much!

Note that the characters in my id file after the space were removed, otherwise resulting in only the first base to be printed.

ADD REPLY
0
Entering edit mode

In the fasta sequence format the entry identifier is the first token (non-whitespace "word") on the header line. Any information on the header line after that is assumed to be a description, and is not part of the identifier.

Thus to use standard sequence manipulation tools (e.g. EMBOSS, NCBI BLAST+, etc.) with identifiers that contain whitespace, the whitespace will need to be replaced (typically with '_' or '-').

ADD REPLY
0
Entering edit mode
7.8 years ago
Prakki Rama ★ 2.5k

I tried the same in perl.

open FH,"idlist.txt";

  while(<FH>)
    {
    $_ =~ s/^\s+|\s+$//; #remove leading and trailing space
    $Uniqheader{$_}='0'; #Hash
    }

open HF,"fasta.fa";

  while(<HF>)
    {
    $id=$_;
    $seq=<HF>;
    if($id=~/^\>(.+)\s*$/ && (exists($Uniqheader{$1})))
        {
        print "$id$seq";
        }
    }
    close(FH);
    close(HF);
ADD COMMENT
0
Entering edit mode
7.8 years ago

Have you considered gffread from the Tuxedo guys? http://cufflinks.cbcb.umd.edu/gff.html

ADD COMMENT
0
Entering edit mode

It looks like that requires a gff file, as the name implies, but I could be wrong. I'm not sure that will be helpful in this case since the input is just a list of IDs.

ADD REPLY
0
Entering edit mode
7.8 years ago

In addition to using seqtk, see this post: Pyfaidx: Efficient, "Pythonic" Random Access To Fasta Files Using Samtools-Compatible Indexing. You can use pyfaidx either by importing the `Fasta` class and using it like a dictionary in a Python script, or using the CLI script in the same way as seqtk.

ADD COMMENT
0
Entering edit mode
5.0 years ago
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("-f", "--fasta", help="Input file (.fasta format)",  required=True)
parser.add_argument("-i", "--id", help="Seq id",  required=True)
args = parser.parse_args()

for record in SeqIO.parse(args.fasta,'fasta'):
    if record.id == args.id:
        print record.seq
        break

python extract_id_fasta.py -f fastafile.fa -i seq_id

ADD COMMENT

Login before adding your answer.

Traffic: 1871 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6