why different outputs with applying various extracting fasta sequences tool, which one correct!?
0
0
Entering edit mode
7.6 years ago
seta ★ 1.8k

Hi everybody,

I faced with a crazy problem during extraction of desired sequence. I want to extract some sequences within a fasta sequence file based on their header name, the list name contain 86563 header names, one per line. I used tools for this end, but the number of output derived from all tools were different and also not the same with the number of list name (86563). In detail, I used the awk command

awk '
NR==1{printf $0"\t";next} {printf /^>/ ? "\n"$0"\t" : $0} ' test.fa | \ awk -F"\t" ' BEGIN{while((getline k < "IDs.txt")>0)i[k]=1} {gsub("^>","",$0); if(i[$1]){print ">"$1"\n"$2}} ' that returned 87812 sequence. with the python script: #!/usr/bin/env python # -*- coding: utf-8 -*- """Extract sequences from a fasta file if their name is in a 'wanted' file. Wanted file contains one sequence name per line. Usage: %program <input_file> <wanted_file> <output_file>""" import sys import re try: from Bio import SeqIO except: print "This program requires the Biopython library" sys.exit(0) try: fasta_file = sys.argv[1] # Input fasta file wanted_file = sys.argv[2] # Input wanted file, one gene name per line result_file = sys.argv[3] # Output fasta file except: print __doc__ sys.exit(0) wanted = set() with open(wanted_file) as f: for line in f: line = line.strip() if line != "": wanted.add(line) fasta_sequences = SeqIO.parse(open(fasta_file),'fasta') with open(result_file, "w") as f: for seq in fasta_sequences: name = seq.id if name in wanted and len(seq.seq.tostring()) > 0: wanted.remove(name) # Output only the first appearance for a name SeqIO.write([seq], f, "fasta") I got 60432 sequence. And with the another script: #!/bin/bash #filterbyname in=<infile> out=<outfile> function usage(){ echo " Written by Brian Bushnell Last modified May 28, 2015 Description: Filters reads by name. Usage: filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f> in2 and out2 are for paired reads and are optional. If input is paired and there is only one output file, it will be written interleaved. Other parameters and their defaults: include=f Set to 'true' to include the filtered names rather than excluding them. substring=f Allow one name to be a substring of the other, rather than a full match. f: No substring matching. t: Bidirectional substring matching. header: Allow input read headers to be substrings of names in list. name: Allow names in list to be substrings of input read headers. casesensitive=t (case) Match case also. ow=t (overwrite) Overwrites files that already exist. app=f (append) Append to files that already exist. zl=4 (ziplevel) Set compression level, 1 (low) to 9 (max). int=f (interleaved) Determines whether INPUT file is considered interleaved. names= A list of strings or files. The files can have one name per line, or be a standard read file (fasta, fastq, or sam). minlen=0 Do not output reads shorter than this. Java Parameters: -Xmx This will be passed to Java to set memory usage, overriding the program's automatic memory detection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. To read from stdin, set 'in=stdin'. The format should be specified with an extension, like 'in=stdin.fq.gz' To write to stdout, set 'out=stdout'. The format should be specified with an extension, like 'out=stdout.fasta' Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems. " } pushd . > /dev/null DIR="${BASH_SOURCE[0]}"
while [ -h "$DIR" ]; do cd "$(dirname "$DIR")" DIR="$(readlink "$(basename "$DIR")")"
done
cd "$(dirname "$DIR")"
DIR="$(pwd)/" popd > /dev/null #DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/"
CP="$DIR""current/" z="-Xmx800m" EA="-ea" set=0 if [ -z "$1" ] || [[ $1 == -h ]] || [[$1 == --help ]]; then
usage
exit
fi

calcXmx () {
source "$DIR""/calcmem.sh" parseXmx "$@"
if [[ $set == 1 ]]; then return fi freeRam 800m 84 z="-Xmx${RAM}m"
z2="-Xms${RAM}m" } calcXmx "$@"

function filterbyname() {
local CMD="java $EA$z -cp $CP driver.FilterReadsByName$@"
echo $CMD >&2 eval$CMD
}

filterbyname "@" with the above script, I got 89718 sequences. Could you please help me why there is a such differences and also why none of scripts didn't give me the right number of sequences, which equal with the number of header name? Thanks so much in advance for your help and solving the issue alignment sequencing RNA-Seq • 3.3k views ADD COMMENT 0 Entering edit mode Could you post output of sort -u IDs.txt | wc -l and head IDs.txt and grep -c "^>" your.fasta and grep "^>" your.fasta | sort -u | wc -l? ADD REPLY 0 Entering edit mode Thanks. Yes, of course, The output of sort -u IDs.txt | wc -l was 60497 head IDs.txt Contig1 Contig2 Contig3 Contig5 Contig6 Contig7 Contig8 Contig9 Contig10 Contig11 the output of grep -c "^>" file.fasta was 91578 and the output of grep "^>" file.fasta | sort -u | wc -l was 64142. I'm so grateful for your helpful suggestion. ADD REPLY 0 Entering edit mode Well, these numbers probably explain the different outputs, e.g. one approach may fetch all seqs that have e.g. ">Contig1" header while another may only fetch the first seq with ">Contig1" header.. ADD REPLY 0 Entering edit mode You're right, I also have for example contig1100, contig1101; however, the number of sequences that I got using the above scripts were not the same with the header name list! Could you please let me know what's your suggested approach to get the sequences of interest in this situation? ADD REPLY 0 Entering edit mode Are all the sequences with identical headers the same in your fasta file? If not.. you're out of luck. >Contig1 ATAT >Contig1 ATTA How could any algorithm know which one of these you want? If the names don't matter, you could parse your fasta into "header - tab" format: Contig1 ATAT Contig1 ATTA Contig2 AAAA Contig3 AAAA Sort by colum 1, and then sort your ID list also and: join -1 1 -2 1 -t'\t' -o 2.1,2.2

And then finally parse the output back to fasta format. This would work assuming that if you have e.g. two Contig1 in your id list, there are also exactly two Contig1 headers in your fasta file (and same rule must apply to all other redundant IDs/headers). Probably not the case, but who knows..

0
Entering edit mode

Actually, there is not identical header in my fasta file, there may be similar header, for example I have contig1, contig11, contig1100, contig1101, contig1102, which all begin with 1.

1
Entering edit mode

If there were no identical headers in your fasta file, then grep -c "^>" file.fasta (how many headers in file.fasta) and grep "^>" file.fasta | sort -u | wc -l (how many unique headers in file.fasta) would have returned the same number. grep "^>" file.fasta | sort | uniq -c | sort -k1,1gr | head would show some of the most redundant headers and their counts..

0
Entering edit mode

That's right, but when I visually see contig header in excel file, there is only for example one contig1. may be the all contigs names that begin with 1 (say contig11, contig1101, contig1102,...) considered identical by scripts and your command, is it right in your view?

0
Entering edit mode

No. Contig1 and Contig11 would not be considered identical by uniq and neither would contig1101 and contig1102. Just check the last command I posted. Also, I highly recommend that you stop using excel.

0
Entering edit mode

Hi my friend. Thanks for all your comments. that's completely right, there is redundant sequences based on the output of grep "^>" file.fasta | sort | uniq -c | sort -k1,1gr | head, which was:

6 >contig10621
6 >contig26703
6 >contig26988
6 >contig37802
6 >contig38362
6 >contig40056
6 >contig5309
5 >contig10647
5 >contig10801
5 >contig10987

please accept my deep apologize, I'm just student and always learning. Actually, I got the fasta file from my friend, this fasta file is combination of different assemblies at various k-mer that subjected to cd-hit-est tool. In each (single) assembly, the header was named "contig", which could cause problem, and then all assemblies merged together. To solve such a problems, the contigs in each assembly should named with a specific name, which isn't the same with contigs name in other assembly, am I right?

Thanks

0
Entering edit mode

No problem. I as much as guessed that merging of assemblies was behind this. Your solution is also correct. Good luck.

0
Entering edit mode

Thanks so much my friend.