Question: Useful Bash Commands To Handle Fasta Files
20
gravatar for Anima Mundi
5.6 years ago by
Anima Mundi2.1k
Italy
Anima Mundi2.1k wrote:

Hello,

starting from this question, I realized that the proper usage of bash commands to handle FASTA files* could be, for those (like me) not proficient with the usage of the terminal, a difficult task. Also, I feel it is important to learn how to use them correctly. Could you point me out what are, in your personal experience, the most important commands useful in FASTA lists manipulation? Possibly, I would prioritize commands wich are easy to use and, possibly, versatile.

If you want, I agree to treat the topic as a community wiki.

*Extract IDs, remove certain sequences, edit descriptions, listing only the sequences that start with A and similar.

fasta list command-line bash • 46k views
ADD COMMENTlink modified 11 days ago by Kevin Blighe1.1k • written 5.6 years ago by Anima Mundi2.1k
4

cat, grep, sed, cut, sort, uniq, tr, awk, paste, Redirections (<, >, >>, ...)

ADD REPLYlink written 5.6 years ago by toni2.1k
1

You can certainly do a lot of FASTA file processing (and any other text file) in the shell. However, I'd recommend learning at least one of the Bio* libraries (e.g. Bioperl's Seq::IO) for more versatile solutions.

ADD REPLYlink written 5.6 years ago by Neilfws47k
1

I like very much to have a repository of common and useful bash scripts/commands. I suggest to extend this repository to more topics like VCF, BED manipulations, etc.

ADD REPLYlink written 5.6 years ago by Pascal1.4k

Thanks! 

ADD REPLYlink written 2.0 years ago by hbai5210
30
gravatar for Damian Kao
5.6 years ago by
Damian Kao14k
UK
Damian Kao14k wrote:

My top used bash commands for fasta files:

(1) counting number of sequences in a fasta file:

grep -c "^>" file.fa

(2) add something to end of all header lines:

sed 's/>.*/&WHATEVERYOUWANT/' file.fa > outfile.fa

(3) clean up a fasta file so only first column of the header is outputted:

awk '{print $1}' file.fa > output.fa

Here are some other cool ones: http://chrisduran.eu/bioinformatics/linux-and-osx-commands-for-working-with-fasta-files/

ADD COMMENTlink modified 5.6 years ago by Neilfws47k • written 5.6 years ago by Damian Kao14k
4

First example probably my all-time favourite.

ADD REPLYlink written 5.6 years ago by Neilfws47k
wc -l file.fa

or

wc file.fa

in the first column, you get the line count. divide by 2 if fasta, 4 if fastq.

This probably way faster than the grep approach.

ADD REPLYlink written 9 months ago by Echo70

This is good but wont work for the odd fasta file where the sequence is wrapped onto a newline after a certain number of characters as some programs like to return.

ADD REPLYlink written 9 months ago by Sentinel15680
1

This also is interesting: what if I would like to replace the "WHATEVERYOUWANT" with a variable with a different value for each sequence?

ADD REPLYlink written 5.6 years ago by Anima Mundi2.1k

do you know any tools for counting the actual sequence , ie the characters in a single sequence , a cds file, a protein string

ADD REPLYlink written 2.6 years ago by vigneshprbh3710
12
gravatar for Frédéric Mahé
5.6 years ago by
Kaiserslautern, Germany
Frédéric Mahé2.6k wrote:

You will probably get a lot of different answers because there are many ways to parse fasta files with Bash and tools like grep, awk and sed. Here are some suggestions.

To extract ids, just use the following:

grep -o -E "^>\w+" file.fasta | tr -d ">"

A useful step is to linearize your sequences (i.e. remove the sequence wrapping). This is not a perfect solution, as I suspect that a few steps could be avoided, but it works quite fast, even for thousands of sequences.

sed -e 's/\(^>.*$\)/#\1#/' file.fasta | tr -d "\r" | tr -d "\n" | sed -e 's/$/#/' | tr "#" "\n" | sed -e '/^$/d'

Remove duplicated sequences. Pierre Lindenbaum proposed this solution.

sed -e '/^>/s/$/@/' -e 's/^>/#/' file.fasta | tr -d '\n' | tr "#" "\n" | tr "@" "\t" | sort -u -t $'\t' -f -k 2,2  | sed -e 's/^/>/' -e 's/\t/\n/'

Once you've done that, it's easier to do things such as "list only the sequences that start with A", with grep for instance.

ADD COMMENTlink written 5.6 years ago by Frédéric Mahé2.6k

Interesting, also this should work to list only the sequences that start with A: awk 'sub(/^A/, ">1\nA")'

ADD REPLYlink written 5.6 years ago by Anima Mundi2.1k

Also useful after linearizing your sequences and losing the original file is the fold command to re-wrap a text file.

ADD REPLYlink written 3.6 years ago by Matt Shirley7.8k
3
gravatar for Martin A Hansen
5.6 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

Have a look at Biopieces www.biopieces.org).

ADD COMMENTlink written 5.6 years ago by Martin A Hansen3.0k
2
gravatar for Chun-Jie Liu
9 months ago by
Chun-Jie Liu230
US, Houston
Chun-Jie Liu230 wrote:

I write some general scripts for handle NGS data. I hope it can help you. https://github.com/chunjie-sam-liu/useful-scripts

ADD COMMENTlink written 9 months ago by Chun-Jie Liu230
1
gravatar for nim.1111.ou
13 months ago by
nim.1111.ou10
Taiwan
nim.1111.ou10 wrote:

ive been working on bioinformatics for over three years, here are lots of frequently used bash commands https://github.com/onceupon/Bash-Oneliner/blob/master/README.md

ADD COMMENTlink modified 8 months ago • written 13 months ago by nim.1111.ou10
1
gravatar for vivekananthrp
9 months ago by
vivekananthrp10 wrote:

Linearizing the complete fasta file can be done using,

while read line;do if [ "${line:0:1}" == ">" ]; then echo -e "\n"$line; else echo $line | tr -d '\n' ; fi; done < input.fasta > output.fasta

Once linearized, say you want pick the sequence for the id 'Q15049' you can use

grep -A1 'Q15049' output.fasta

This will give you the header and sequence of the id.

Hope its useful

ADD COMMENTlink written 9 months ago by vivekananthrp10
0
gravatar for shenwei356
9 months ago by
shenwei3563.3k
China
shenwei3563.3k wrote:

Tool: FASTA and FASTQ tools

ADD COMMENTlink written 9 months ago by shenwei3563.3k
0
gravatar for Kevin Blighe
11 days ago by
Kevin Blighe1.1k
Republic of Ireland
Kevin Blighe1.1k wrote:

Gets the mirror image of a FASTA file (flips the sequence)

#Title:     MirrorFASTA
#Details:   Get the 'mirror' sequences of a FASTA file
#Usage:     MirrorFASTA.sh [INPUT] [OUTPUT]
#Author:     Kevin Blighe
#Date:      9th March 2016

if [ $# -ne 2 ] ;
then
    echo "Illegal number of parameters"
    echo "Usage:   MirrorFASTA.sh [INPUT] [OUTPUT]"
    exit 1
fi

count=0

touch $2

while read line           
do
    ((count+=1))

    if ! (($count%2)) ;
    then
        echo $line | rev >> $2
    else
        echo $line >> $2
    fi
done < $1
ADD COMMENTlink written 11 days ago by Kevin Blighe1.1k

you mean reverse sequence? try seqkit seq -r

ADD REPLYlink written 11 days ago by shenwei3563.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1388 users visited in the last hour