Question: Useful Bash Commands To Handle Fasta Files
20
gravatar for Anima Mundi
5.8 years ago by
Anima Mundi2.2k
Italy
Anima Mundi2.2k wrote:

Hello,

starting from this question, I realized that the proper usage of bash commands to handle FASTA files* could be, for those (like me) not proficient with the usage of the terminal, a difficult task. Also, I feel it is important to learn how to use them correctly. Could you point me out what are, in your personal experience, the most important commands useful in FASTA lists manipulation? Possibly, I would prioritize commands wich are easy to use and, possibly, versatile.

If you want, I agree to treat the topic as a community wiki.

*Extract IDs, remove certain sequences, edit descriptions, listing only the sequences that start with A and similar.

fasta list command-line bash • 49k views
ADD COMMENTlink modified 5 weeks ago by jrj.healey2.9k • written 5.8 years ago by Anima Mundi2.2k
4

cat, grep, sed, cut, sort, uniq, tr, awk, paste, Redirections (<, >, >>, ...)

ADD REPLYlink written 5.8 years ago by toni2.1k
1

You can certainly do a lot of FASTA file processing (and any other text file) in the shell. However, I'd recommend learning at least one of the Bio* libraries (e.g. Bioperl's Seq::IO) for more versatile solutions.

ADD REPLYlink written 5.8 years ago by Neilfws47k
1

I like very much to have a repository of common and useful bash scripts/commands. I suggest to extend this repository to more topics like VCF, BED manipulations, etc.

ADD REPLYlink written 5.8 years ago by Pascal1.4k

Thanks! 

ADD REPLYlink written 2.2 years ago by hbai5210
33
gravatar for Damian Kao
5.8 years ago by
Damian Kao14k
USA
Damian Kao14k wrote:

My top used bash commands for fasta files:

(1) counting number of sequences in a fasta file:

grep -c "^>" file.fa

(2) add something to end of all header lines:

sed 's/>.*/&WHATEVERYOUWANT/' file.fa > outfile.fa

(3) clean up a fasta file so only first column of the header is outputted:

awk '{print $1}' file.fa > output.fa

Here are some other cool ones: http://chrisduran.eu/bioinformatics/linux-and-osx-commands-for-working-with-fasta-files/

ADD COMMENTlink modified 5.8 years ago by Neilfws47k • written 5.8 years ago by Damian Kao14k
4

First example probably my all-time favourite.

ADD REPLYlink written 5.8 years ago by Neilfws47k
wc -l file.fa

or

wc file.fa

in the first column, you get the line count. divide by 2 if fasta, 4 if fastq.

This probably way faster than the grep approach.

ADD REPLYlink written 11 months ago by Echo70
1

This is good but wont work for the odd fasta file where the sequence is wrapped onto a newline after a certain number of characters as some programs like to return.

ADD REPLYlink written 11 months ago by Sentinel15690
1

This also is interesting: what if I would like to replace the "WHATEVERYOUWANT" with a variable with a different value for each sequence?

ADD REPLYlink written 5.8 years ago by Anima Mundi2.2k

do you know any tools for counting the actual sequence , ie the characters in a single sequence , a cds file, a protein string

ADD REPLYlink written 2.8 years ago by vigneshprbh3710
13
gravatar for Frédéric Mahé
5.8 years ago by
Kaiserslautern, Germany
Frédéric Mahé2.7k wrote:

You will probably get a lot of different answers because there are many ways to parse fasta files with Bash and tools like grep, awk and sed. Here are some suggestions.

To extract ids, just use the following:

grep -o -E "^>\w+" file.fasta | tr -d ">"

A useful step is to linearize your sequences (i.e. remove the sequence wrapping). This is not a perfect solution, as I suspect that a few steps could be avoided, but it works quite fast, even for thousands of sequences.

sed -e 's/\(^>.*$\)/#\1#/' file.fasta | tr -d "\r" | tr -d "\n" | sed -e 's/$/#/' | tr "#" "\n" | sed -e '/^$/d'

Remove duplicated sequences. Pierre Lindenbaum proposed this solution.

sed -e '/^>/s/$/@/' -e 's/^>/#/' file.fasta | tr -d '\n' | tr "#" "\n" | tr "@" "\t" | sort -u -t $'\t' -f -k 2,2  | sed -e 's/^/>/' -e 's/\t/\n/'

Once you've done that, it's easier to do things such as "list only the sequences that start with A", with grep for instance.

ADD COMMENTlink written 5.8 years ago by Frédéric Mahé2.7k

Interesting, also this should work to list only the sequences that start with A: awk 'sub(/^A/, ">1\nA")'

ADD REPLYlink written 5.8 years ago by Anima Mundi2.2k

Also useful after linearizing your sequences and losing the original file is the fold command to re-wrap a text file.

ADD REPLYlink written 3.8 years ago by Matt Shirley8.0k
3
gravatar for Martin A Hansen
5.8 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

Have a look at Biopieces www.biopieces.org).

ADD COMMENTlink written 5.8 years ago by Martin A Hansen3.0k
2
gravatar for Chun-Jie Liu
11 months ago by
Chun-Jie Liu230
US, Houston
Chun-Jie Liu230 wrote:

I write some general scripts for handle NGS data. I hope it can help you. https://github.com/chunjie-sam-liu/useful-scripts

ADD COMMENTlink written 11 months ago by Chun-Jie Liu230
1
gravatar for nim.1111.ou
15 months ago by
nim.1111.ou10
Taiwan
nim.1111.ou10 wrote:

ive been working on bioinformatics for over three years, here are lots of frequently used bash commands https://github.com/onceupon/Bash-Oneliner/blob/master/README.md

ADD COMMENTlink modified 10 months ago • written 15 months ago by nim.1111.ou10
1
gravatar for vivekananthrp
11 months ago by
vivekananthrp10 wrote:

Linearizing the complete fasta file can be done using,

while read line;do if [ "${line:0:1}" == ">" ]; then echo -e "\n"$line; else echo $line | tr -d '\n' ; fi; done < input.fasta > output.fasta

Once linearized, say you want pick the sequence for the id 'Q15049' you can use

grep -A1 'Q15049' output.fasta

This will give you the header and sequence of the id.

Hope its useful

ADD COMMENTlink written 11 months ago by vivekananthrp10
1
gravatar for Vijay Lakhujani
5 weeks ago by
Vijay Lakhujani1.3k
India
Vijay Lakhujani1.3k wrote:

Bash one liner to get the A T (or U) G C counts for all the sequences in a multi fasta file

echo -e "seq_id\tA\tU\tG\tC"; while read line; do echo $line | grep ">" | sed 's/>//g'; for i in A U G C;do echo $line | grep -v ">" | grep -o $i | wc -l | grep -v "^0"; done; done < your_fasta_file.fa | paste - - - - -

Example

$echo -e "seq_id\tA\tT\tG\tC"; while read line; do echo $line | grep ">" | sed 's/>//g'; for i in A T G C;do echo $line | grep -v ">" | grep -o $i | wc -l | grep -v "^0"; done; done < adapter.fasta | paste - - - - -

Output

seq_id  A   T   G   C
adapter1    7   8   7   8
adapter2    4   3   3   2
ADD COMMENTlink written 5 weeks ago by Vijay Lakhujani1.3k
1
gravatar for jrj.healey
5 weeks ago by
jrj.healey2.9k
United Kingdom
jrj.healey2.9k wrote:

Couple I stick in my .bashrc for quick and dirty work (they all print to screen so they can be piped):


Return the lengths of all the sequences in a multifasta

falens(){
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' $1
}

*Disclaimer - I think I stole this off the internet somewhere.


Remove duplicated fastas in a multifasta:

dedupe(){
cat $1 | awk '!_[$0]++'
}

Merge a multifasta into a single fasta sequence (remove all but the first header, and deal with newlines):

fastcat(){
cat $1 | sed -e '1!{/^>.*/d;}' | sed  ':a;N;$!ba;s/\n//2g'
}

Split a multifasta in to separate files (with arbitrary names). This was my pure bash answer to a biostars post some time ago (doesn't print to screen, just makes files in current dir).

splitfa(){
i=1;
while read line ; do
  if [ ${line:0:1} == ">" ] ; then
    header="$line"
    echo "$header" >> seq"${i}".fasta
  else
    seq="$line"
    echo "$seq" >> seq"${i}".fasta
    ((i++))
  fi
done < $1
}
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by jrj.healey2.9k
  1. Save all these quick scripts in a file under, say, ~/.myShellScripts/fasta_utils.sh
  2. Add source ~/.myShellScripts/fasta_utils.sh to ~/.${SHELL}rc
  3. Better modular maintenance :-)

I also save everything in ~/.zshrc, but I love how oh-my-zsh does stuff. They complicate it too much though and I have to unset their stuff before I set mine.

ADD REPLYlink written 5 weeks ago by Ram12k

That's quite a nice way to do it, in actual fact I lied slightly, and these functions are in a file called .bash_functions which .bashrc then sources directly.

ADD REPLYlink written 5 weeks ago by jrj.healey2.9k
0
gravatar for shenwei356
11 months ago by
shenwei3563.4k
China
shenwei3563.4k wrote:

Tool: FASTA and FASTQ tools

ADD COMMENTlink written 11 months ago by shenwei3563.4k
0
gravatar for Kevin Blighe
10 weeks ago by
Kevin Blighe7.3k
Republic of Ireland (Éire)
Kevin Blighe7.3k wrote:

Gets the mirror image of a FASTA file (flips the sequence)

#Title:     MirrorFASTA
#Details:   Get the 'mirror' sequences of a FASTA file
#Usage:     MirrorFASTA.sh [INPUT] [OUTPUT]
#Author:     Kevin Blighe
#Date:      9th March 2016

if [ $# -ne 2 ] ;
then
    echo "Illegal number of parameters"
    echo "Usage:   MirrorFASTA.sh [INPUT] [OUTPUT]"
    exit 1
fi

count=0

touch $2

while read line           
do
    ((count+=1))

    if ! (($count%2)) ;
    then
        echo $line | rev >> $2
    else
        echo $line >> $2
    fi
done < $1
ADD COMMENTlink written 10 weeks ago by Kevin Blighe7.3k
1

you mean reverse sequence? try seqkit seq -r

ADD REPLYlink written 10 weeks ago by shenwei3563.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 703 users visited in the last hour