Question: Extracting multiple sequences by ID from a FASTA file
1
gravatar for Allan
4.9 years ago by
Allan10
United States
Allan10 wrote:

Following the instructions as shown in Extracting Multiple Fasta Sequences At A Time From A File Containing Many Sequences for retrieving a list of sequences by ID, the only instance this works correctly is for the last ID in the list file.

Please note BioPerl is not an option (HPCC management teams inability to install correctly). 

 

Given a FASTA file Fec2_001.fa:

>M02471:1:000000000-A6VM6:1:1101:18323:1737 1:N:0:6
GTCACCAAGACCGCGCAGACCGGCGAAATGTATTATTCCTTGCCGCAGCGCATGATTCTGCCGGGCTACAACCCGACCACCAAGGCGCACGGTCGCGTGCT
>M02471:1:000000000-A6VM6:1:1101:14489:1760 1:N:0:6
CACATGCCCTTGTATCAGGAGCTGCGTCGTCGCATGGATGTCGGCGAGTTCGGCCGCATGAATCTCGCTCAGCTCAACTTTGGCAGCTATAAGGAGTACGG
>M02471:1:000000000-A6VM6:1:1101:14668:1789 1:N:0:6
AGCTCTTGAGCTCGCTCGTATCGCCCTTGTCCTTCTTCGTCTTCCAGTAATGACCCGTCACGCTTCCGTCTTGAACCGTATAGGTCAGATCTTCGACGCG

 

And list of ID's that need to be extracted in BLASTids.txt:

M02471:1:000000000-A6VM6:1:1101:18323:1737 1:N:0:6
M02471:1:000000000-A6VM6:1:1101:14668:1789 1:N:0:6

 

I've tried the following, but each time the only sequence returned is that matching the last ID. I'm assuming the newline characters are cause of this, but my lack perl programming isn't leading me to any solution.

First:

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' BLAST.txt Fec2_001.fa

Second (from biostar link above):

use warnings;
use strict;

my $lookup = shift @ARGV;  # ID to extract

local $/ = "\n>";  # read by FASTA record

while (my $seq = <>) {
    chomp $seq;
    my ($id) = $seq =~ /^>*(\S+)/;  # parse ID as first word in FASTA header
    if ($id eq $lookup) {
        print "$seq\n";
        last;
    }
}

These two are among several other solutions other's have shown to work. Any guidance is greatly appreciated!

 

 

 

 

 

 

fasta • 5.7k views
ADD COMMENTlink modified 2.1 years ago by juan.crescente20 • written 4.9 years ago by Allan10
5
gravatar for cts
4.9 years ago by
cts1.6k
Pasadena
cts1.6k wrote:

Use seqtk which has a subseq command for extracting multiple reads using their headers

ADD COMMENTlink written 4.9 years ago by cts1.6k

Wonderful. Thank you much!

Note that the characters in my id file after the space were removed, otherwise resulting in only the first base to be printed. 

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by Allan10

In the fasta sequence format the entry identifier is the first token (non-whitespace "word") on the header line. Any information on the header line after that is assumed to be a description, and is not part of the identifier.

Thus to use standard sequence manipulation tools (e.g. EMBOSS, NCBI BLAST+, etc.) with identifiers that contain whitespace, the whitespace will need to be replaced (typically with '_' or '-').

 

ADD REPLYlink written 4.9 years ago by hpmcwill1.1k
0
gravatar for Prakki Rama
4.9 years ago by
Prakki Rama2.2k
Singapore
Prakki Rama2.2k wrote:

I tried the same in perl.

open FH,"idlist.txt";

  while(<FH>)
    {
    $_ =~ s/^\s+|\s+$//; #remove leading and trailing space
    $Uniqheader{$_}='0'; #Hash
    }

open HF,"fasta.fa";
  
  while(<HF>)
    {
    $id=$_;
    $seq=<HF>;
    if($id=~/^\>(.+)\s*$/ && (exists($Uniqheader{$1})))
        {
        print "$id$seq";
        }
    }
    close(FH);
    close(HF);

 

 

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Prakki Rama2.2k
0
gravatar for andrew.j.skelton73
4.9 years ago by
London
andrew.j.skelton735.5k wrote:

Have you considered gffread from the Tuxedo guys? http://cufflinks.cbcb.umd.edu/gff.html  

ADD COMMENTlink written 4.9 years ago by andrew.j.skelton735.5k

It looks like that requires a gff file, as the name implies, but I could be wrong. I'm not sure that will be helpful in this case since the input is just a list of IDs.

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by SES8.1k
0
gravatar for Matt Shirley
4.9 years ago by
Matt Shirley8.9k
Cambridge, MA
Matt Shirley8.9k wrote:

In addition to using seqtk, see this post: Pyfaidx: Efficient, "Pythonic" Random Access To Fasta Files Using Samtools-Compatible Indexing. You can use pyfaidx either by importing the `Fasta` class and using it like a dictionary in a Python script, or using the CLI script in the same way as seqtk.

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Matt Shirley8.9k
0
gravatar for juan.crescente
2.1 years ago by
juan.crescente20 wrote:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("-f", "--fasta", help="Input file (.fasta format)",  required=True)
parser.add_argument("-i", "--id", help="Seq id",  required=True)
args = parser.parse_args()

for record in SeqIO.parse(args.fasta,'fasta'):
    if record.id == args.id:
        print record.seq
        break

python extract_id_fasta.py -f fastafile.fa -i seq_id

ADD COMMENTlink written 2.1 years ago by juan.crescente20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2489 users visited in the last hour