why different outputs with applying various extracting fasta sequences tool, which one correct!?
0
0
Entering edit mode
8.7 years ago
seta ★ 1.9k

Hi everybody,

I faced with a crazy problem during extraction of desired sequence. I want to extract some sequences within a fasta sequence file based on their header name, the list name contain 86563 header names, one per line. I used tools for this end, but the number of output derived from all tools were different and also not the same with the number of list name (86563). In detail, I used the awk command

awk '
  NR==1{printf $0"\t";next}
  {printf /^>/ ? "\n"$0"\t" : $0}
  ' test.fa | \
    awk -F"\t" '
      BEGIN{while((getline k < "IDs.txt")>0)i[k]=1}
      {gsub("^>","",$0); if(i[$1]){print ">"$1"\n"$2}}
      '

that returned 87812 sequence.

with the python script:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Extract sequences from a fasta file if their name is in a 'wanted' file.

Wanted file contains one sequence name per line.

Usage:
    %program <input_file> <wanted_file> <output_file>"""

import sys
import re

try:
    from Bio import SeqIO
except:
    print "This program requires the Biopython library"
    sys.exit(0)

try:
    fasta_file = sys.argv[1]  # Input fasta file
    wanted_file = sys.argv[2] # Input wanted file, one gene name per line
    result_file = sys.argv[3] # Output fasta file
except:
    print __doc__
    sys.exit(0)

wanted = set()
with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')

with open(result_file, "w") as f:
    for seq in fasta_sequences:
        name = seq.id
        if name in wanted and len(seq.seq.tostring()) > 0:
            wanted.remove(name) # Output only the first appearance for a name
            SeqIO.write([seq], f, "fasta")

I got 60432 sequence.

And with the another script:

#!/bin/bash
#filterbyname in=<infile> out=<outfile>

function usage(){
echo "
Written by Brian Bushnell
Last modified May 28, 2015

Description:  Filters reads by name.

Usage:  filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>

in2 and out2 are for paired reads and are optional.
If input is paired and there is only one output file, it will be written interleaved.
Other parameters and their defaults:

include=f           Set to 'true' to include the filtered names rather than excluding them.
substring=f         Allow one name to be a substring of the other, rather than a full match.
                         f: No substring matching.
                         t: Bidirectional substring matching.
                         header: Allow input read headers to be substrings of names in list.
                         name: Allow names in list to be substrings of input read headers.
casesensitive=t     (case) Match case also.
ow=t                (overwrite) Overwrites files that already exist.
app=f               (append) Append to files that already exist.
zl=4                (ziplevel) Set compression level, 1 (low) to 9 (max).
int=f               (interleaved) Determines whether INPUT file is considered interleaved.
names=              A list of strings or files.  The files can have one name per line, or
                    be a standard read file (fasta, fastq, or sam).
minlen=0            Do not output reads shorter than this.


Java Parameters:
-Xmx               This will be passed to Java to set memory usage, overriding the program's automatic memory detection.
                -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.  The max is typically 85% of physical memory.

To read from stdin, set 'in=stdin'.  The format should be specified with an extension, like 'in=stdin.fq.gz'
To write to stdout, set 'out=stdout'.  The format should be specified with an extension, like 'out=stdout.fasta'

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
"
}

pushd . > /dev/null
DIR="${BASH_SOURCE[0]}"
while [ -h "$DIR" ]; do
  cd "$(dirname "$DIR")"
  DIR="$(readlink "$(basename "$DIR")")"
done
cd "$(dirname "$DIR")"
DIR="$(pwd)/"
popd > /dev/null

#DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/"
CP="$DIR""current/"

z="-Xmx800m"
EA="-ea"
set=0

if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then
    usage
    exit
fi

calcXmx () {
    source "$DIR""/calcmem.sh"
    parseXmx "$@"
    if [[ $set == 1 ]]; then
        return
    fi
    freeRam 800m 84
    z="-Xmx${RAM}m"
    z2="-Xms${RAM}m"
}
calcXmx "$@"

function filterbyname() {
    #module unload oracle-jdk
    #module unload samtools
    #module load oracle-jdk/1.7_64bit
    #module load pigz
    #module load samtools
    local CMD="java $EA $z -cp $CP driver.FilterReadsByName $@"
    echo $CMD >&2
    eval $CMD
}

filterbyname "$@"

with the above script, I got 89718 sequences.

Could you please help me why there is a such differences and also why none of scripts didn't give me the right number of sequences, which equal with the number of header name? Thanks so much in advance for your help and solving the issue

alignment sequencing RNA-Seq • 3.8k views
ADD COMMENT
0
Entering edit mode

Could you post output of sort -u IDs.txt | wc -l and head IDs.txt and grep -c "^>" your.fasta and grep "^>" your.fasta | sort -u | wc -l?

ADD REPLY
0
Entering edit mode

Thanks. Yes, of course,

The output of sort -u IDs.txt | wc -l was 60497

head IDs.txt

Contig1
Contig2
Contig3
Contig5
Contig6
Contig7
Contig8
Contig9
Contig10
Contig11

the output of grep -c "^>" file.fasta was 91578 and the output of grep "^>" file.fasta | sort -u | wc -l was 64142.

I'm so grateful for your helpful suggestion.

ADD REPLY
0
Entering edit mode

Well, these numbers probably explain the different outputs, e.g. one approach may fetch all seqs that have e.g. ">Contig1" header while another may only fetch the first seq with ">Contig1" header..

ADD REPLY
0
Entering edit mode

You're right, I also have for example contig1100, contig1101; however, the number of sequences that I got using the above scripts were not the same with the header name list! Could you please let me know what's your suggested approach to get the sequences of interest in this situation?

ADD REPLY
0
Entering edit mode

Are all the sequences with identical headers the same in your fasta file? If not.. you're out of luck.

>Contig1
ATAT
>Contig1
ATTA

How could any algorithm know which one of these you want?

If the names don't matter, you could parse your fasta into "header - tab" format:

Contig1 ATAT
Contig1 ATTA
Contig2 AAAA
Contig3 AAAA

Sort by colum 1, and then sort your ID list also and:

join -1 1 -2 1 -t $'\t' -o 2.1,2.2

And then finally parse the output back to fasta format. This would work assuming that if you have e.g. two Contig1 in your id list, there are also exactly two Contig1 headers in your fasta file (and same rule must apply to all other redundant IDs/headers). Probably not the case, but who knows..

ADD REPLY
0
Entering edit mode

Actually, there is not identical header in my fasta file, there may be similar header, for example I have contig1, contig11, contig1100, contig1101, contig1102, which all begin with 1.

ADD REPLY
1
Entering edit mode

If there were no identical headers in your fasta file, then grep -c "^>" file.fasta (how many headers in file.fasta) and grep "^>" file.fasta | sort -u | wc -l (how many unique headers in file.fasta) would have returned the same number. grep "^>" file.fasta | sort | uniq -c | sort -k1,1gr | head would show some of the most redundant headers and their counts..

ADD REPLY
0
Entering edit mode

That's right, but when I visually see contig header in excel file, there is only for example one contig1. may be the all contigs names that begin with 1 (say contig11, contig1101, contig1102,...) considered identical by scripts and your command, is it right in your view?

ADD REPLY
0
Entering edit mode

No. Contig1 and Contig11 would not be considered identical by uniq and neither would contig1101 and contig1102. Just check the last command I posted. Also, I highly recommend that you stop using excel.

ADD REPLY
0
Entering edit mode

Hi my friend. Thanks for all your comments. that's completely right, there is redundant sequences based on the output of grep "^>" file.fasta | sort | uniq -c | sort -k1,1gr | head, which was:

6 >contig10621
6 >contig26703
6 >contig26988
6 >contig37802
6 >contig38362
6 >contig40056
6 >contig5309
5 >contig10647
5 >contig10801
5 >contig10987

please accept my deep apologize, I'm just student and always learning. Actually, I got the fasta file from my friend, this fasta file is combination of different assemblies at various k-mer that subjected to cd-hit-est tool. In each (single) assembly, the header was named "contig", which could cause problem, and then all assemblies merged together. To solve such a problems, the contigs in each assembly should named with a specific name, which isn't the same with contigs name in other assembly, am I right?

Thanks

ADD REPLY
0
Entering edit mode

No problem. I as much as guessed that merging of assemblies was behind this. Your solution is also correct. Good luck.

ADD REPLY
0
Entering edit mode

Thanks so much my friend.

ADD REPLY

Login before adding your answer.

Traffic: 1855 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6