Output

Question

Useful Bash Commands To Handle Fasta Files

49

Entering edit mode

12.6 years ago

Anima Mundi ★ 2.9k

Hello,

Starting from this question, I realized that the proper usage of bash commands to handle FASTA files* could be, for those (like me) not proficient with the usage of the terminal, a difficult task. Also, I feel it is important to learn how to use them correctly. Could you point me out what are, in your personal experience, the most important commands useful in FASTA lists manipulation? Possibly, I would prioritize commands wich are easy to use and, possibly, versatile.

If you want, I agree to treat the topic as a community wiki.

*Extract IDs, remove certain sequences, edit descriptions, listing only the sequences that start with A and similar.

command-line fasta bash • 142k views

ADD COMMENT • link updated 11 days ago by Ram 44k • written 12.6 years ago by Anima Mundi ★ 2.9k

4

Entering edit mode

cat, grep, sed, cut, sort, uniq, tr, awk, paste, Redirections (<, >, >>, ...)

ADD REPLY • link 12.6 years ago by toni ★ 2.2k

1

Entering edit mode

You can certainly do a lot of FASTA file processing (and any other text file) in the shell. However, I'd recommend learning at least one of the Bio* libraries (e.g. Bioperl's Seq::IO) for more versatile solutions.

ADD REPLY • link 12.6 years ago by Neilfws 49k

1

Entering edit mode

I like very much to have a repository of common and useful bash scripts/commands. I suggest to extend this repository to more topics like VCF, BED manipulations, etc.

ADD REPLY • link 12.6 years ago by Pascal ★ 1.5k

0

Entering edit mode

Thanks!

ADD REPLY • link updated 2.1 years ago by Ram 44k • written 9.0 years ago by hbai521 • 0

Ram · Answer 1 · 2012-02-22

44

Entering edit mode

12.6 years ago

Damian Kao 16k

My top used bash commands for fasta files:

(1) counting number of sequences in a fasta file:

grep -c "^>" file.fa

(2) add something to end of all header lines:

sed 's/>.*/&WHATEVERYOUWANT/' file.fa > outfile.fa

(3) clean up a fasta file so only first column of the header is outputted:

awk '{print $1}' file.fa > output.fa

Here are some other cool ones: http://chrisduran.eu/bioinformatics/linux-and-osx-commands-for-working-with-fasta-files/

ADD COMMENT • link updated 6.1 years ago by Ram 44k • written 12.6 years ago by Damian Kao 16k

5

Entering edit mode

First example probably my all-time favourite.

ADD REPLY • link 12.6 years ago by Neilfws 49k

0

Entering edit mode

wc -l file.fa

or

wc file.fa

in the first column, you get the line count. divide by 2 if fasta, 4 if fastq.

This probably way faster than the grep approach.

ADD REPLY • link 7.8 years ago by Echo ▴ 70

5

Entering edit mode

This is good but wont work for the odd fasta file where the sequence is wrapped onto a newline after a certain number of characters as some programs like to return.

ADD REPLY • link 7.8 years ago by Sentinel156 ▴ 190

1

Entering edit mode

This also is interesting: what if I would like to replace the "WHATEVERYOUWANT" with a variable with a different value for each sequence?

ADD REPLY • link 12.6 years ago by Anima Mundi ★ 2.9k

0

Entering edit mode

do you know any tools for counting the actual sequence , ie the characters in a single sequence , a cds file, a protein string

ADD REPLY • link 9.7 years ago by vigneshprbh37 ▴ 30

0

Entering edit mode

yes, seqmagick, see my answer below

ADD REPLY • link 3.4 years ago by jena ▴ 320

Ram · Answer 2 · 2012-02-22

24

Entering edit mode

12.6 years ago

Frédéric Mahé ★ 3.2k

You will probably get a lot of different answers because there are many ways to parse fasta files with Bash and tools like grep, awk and sed. Here are some suggestions.

To extract ids, just use the following:

grep -o -E "^>\w+" file.fasta | tr -d ">"

A useful step is to linearize your sequences (i.e. remove the sequence wrapping). This is not a perfect solution, as I suspect that a few steps could be avoided, but it works quite fast, even for thousands of sequences.

sed -e 's/\(^>.*$\)/#\1#/' file.fasta | tr -d "\r" | tr -d "\n" | sed -e 's/$/#/' | tr "#" "\n" | sed -e '/^$/d'

Remove duplicated sequences. Pierre Lindenbaum proposed this solution.

sed -e '/^>/s/$/@/' -e 's/^>/#/' file.fasta | tr -d '\n' | tr "#" "\n" | tr "@" "\t" | sort -u -t $'\t' -f -k 2,2  | sed -e 's/^/>/' -e 's/\t/\n/'

Once you've done that, it's easier to do things such as "list only the sequences that start with A", with grep for instance.

ADD COMMENT • link updated 6.1 years ago by Ram 44k • written 12.6 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

Interesting, also this should work to list only the sequences that start with A: awk 'sub(/^A/, ">1\nA")'

ADD REPLY • link 12.6 years ago by Anima Mundi ★ 2.9k

0

Entering edit mode

Also useful after linearizing your sequences and losing the original file is the fold command to re-wrap a text file.

ADD REPLY • link 10.7 years ago by Matt Shirley 10k

score 4 · Answer 3 · 2017-10-14

4

Entering edit mode

7.0 years ago

lakhujanivijay 5.9k

Bash one liner to get the A T (or U) G C counts for all the sequences in a multi fasta file

echo -e "seq_id\tA\tU\tG\tC"; while read line; do echo $line | grep ">" | sed 's/>//g'; for i in A U G C;do echo $line | grep -v ">" | grep -o $i | wc -l | grep -v "^0"; done; done < your_fasta_file.fa | paste - - - - -

Example

$echo -e "seq_id\tA\tT\tG\tC"; while read line; do echo $line | grep ">" | sed 's/>//g'; for i in A T G C;do echo $line | grep -v ">" | grep -o $i | wc -l | grep -v "^0"; done; done < adapter.fasta | paste - - - - -

Output

seq_id  A   T   G   C
adapter1    7   8   7   8
adapter2    4   3   3   2

ADD COMMENT • link 7.0 years ago by lakhujanivijay 5.9k

0

Entering edit mode

This answer will show wrong AUGC count number when one of your sequence ( in your multiple sequence FASTA file) is not with all four characters, AUGC.

For example,

> seq_1
AUGC
> seq_2
AU
> seq_3
G

Then the command line may not be working.

ADD REPLY • link 3.7 years ago by SheelS ▴ 40

score 3 · Answer 4 · 2012-02-23

3

Entering edit mode

12.6 years ago

Martin A Hansen 3.0k

Have a look at Biopieces www.biopieces.org).

ADD COMMENT • link 12.6 years ago by Martin A Hansen 3.0k

score 3 · Answer 5 · 2016-12-13

3

Entering edit mode

7.8 years ago

Chun-Jie Liu ▴ 280

I write some general scripts for handle NGS data. I hope it can help you. https://github.com/chunjie-sam-liu/useful-scripts

ADD COMMENT • link 7.8 years ago by Chun-Jie Liu ▴ 280

score 2 · Answer 6 · 2016-08-08

2

Entering edit mode

8.2 years ago

nim.1111.ou ▴ 180

ive been working on bioinformatics for over three years, here are lots of frequently used bash commands https://github.com/onceupon/Bash-Oneliner/blob/master/README.md

ADD COMMENT • link 7.8 years ago by nim.1111.ou ▴ 180

Ram · Answer 7 · 2016-12-13

2

Entering edit mode

7.8 years ago

vivekananthrp ▴ 20

Linearizing the complete fasta file can be done using,

while read line;do if [ "${line:0:1}" == ">" ]; then echo -e "\n"$line; else echo $line | tr -d '\n' ; fi; done < input.fasta > output.fasta

Once linearized, say you want pick the sequence for the id Q15049 you can use

grep -A1 'Q15049' output.fasta

This will give you the header and sequence of the id.

Hope its useful

ADD COMMENT • link updated 6.1 years ago by Ram 44k • written 7.8 years ago by vivekananthrp ▴ 20

0

Entering edit mode

just two comments.

First, you can remove the initial empty lines piping sed '/^--$/d' so the whole code ends up like

while read line; do
  if [ "${line:0:1}" == ">" ]; then 
    echo -e "\n"$line;
  else 
    echo $line | tr -d '\n' ; 
  fi; 
done < input.fasta | sed '/^--$/d' > output.fasta

And second, the previous code works great but the following awk version is much faster

awk '{if(NR==1) {print $0} else {if($0 ~ /^>/) {print "\n"$0} else {printf $0}}}' input.fasta > output.fasta

ADD REPLY • link 4.4 years ago by alfonsosaera ▴ 50

score 1 · Answer 8 · 2016-12-13

1

Entering edit mode

7.8 years ago

shenwei356 8.6k

Tool: FASTA and FASTQ tools

ADD COMMENT • link 7.8 years ago by shenwei356 8.6k

score 1 · Answer 9 · 2017-10-14

1

Entering edit mode

7.0 years ago

Joe 21k

Couple I stick in my .bashrc for quick and dirty work (they all print to screen so they can be piped):

Return the lengths of all the sequences in a multifasta

falens(){
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' $1
}

*Disclaimer - I think I stole this off the internet somewhere.

Remove duplicated fastas in a multifasta:

dedupe(){
cat $1 | awk '!_[$0]++'
}

Merge a multifasta into a single fasta sequence (remove all but the first header, and deal with newlines):

fastcat(){
cat $1 | sed -e '1!{/^>.*/d;}' | sed  ':a;N;$!ba;s/\n//2g'
}

Split a multifasta in to separate files (with arbitrary names). This was my pure bash answer to a biostars post some time ago (doesn't print to screen, just makes files in current dir).

splitfa(){
i=1;
while read line ; do
  if [ ${line:0:1} == ">" ] ; then
    header="$line"
    echo "$header" >> seq"${i}".fasta
  else
    seq="$line"
    echo "$seq" >> seq"${i}".fasta
    ((i++))
  fi
done < $1
}

ADD COMMENT • link 7.0 years ago by Joe 21k

0

Entering edit mode

Save all these quick scripts in a file under, say, ~/.myShellScripts/fasta_utils.sh
Add source ~/.myShellScripts/fasta_utils.sh to ~/.${SHELL}rc
Better modular maintenance :-)

I also save everything in ~/.zshrc, but I love how oh-my-zsh does stuff. They complicate it too much though and I have to unset their stuff before I set mine.

ADD REPLY • link 7.0 years ago by Ram 44k

0

Entering edit mode

That's quite a nice way to do it, in actual fact I lied slightly, and these functions are in a file called .bash_functions which .bashrc then sources directly.

ADD REPLY • link 7.0 years ago by Joe 21k

0

Entering edit mode

Joe you just add the dir path to .bashrc or you did something else? Thanks

ADD REPLY • link 3.7 years ago by schlogl ▴ 160

1

Entering edit mode

Yeah, it uses Ram's approach and sources the file from within bashrc essentially

ADD REPLY • link 3.7 years ago by Joe 21k

0

Entering edit mode

Hello there. I learned a lot with this code. Thank you! I noticed that the created fasta files only have the first line of the sequence, though. Here goes a small update of splitfa, if I may:

splitfa(){
i=0;
while read line ; do
  if [ ${line:0:1} == ">" ] ; then
    ((i++))
    header="$line"
    echo "$header" >> seq"${i}".fasta
  else
    seq="$line"
    echo "$seq" >> seq"${i}".fasta
  fi
done < $1

ADD REPLY • link 11 days ago by Murucututu • 0

0

Entering edit mode

Good call - the increment should happen outside of the if.

ADD REPLY • link 11 days ago by Ram 44k

score 0 · Answer 10 · 2021-05-25

I think a lot of folks here would benefit from knowing about the seqmagick tool.

Seqmagick is a kickass little utility built in the spirit of imagemagick to expose the file format conversion in Biopython in a convenient way. Instead of having a big mess of scripts, there is one that takes arguments

It can be easily installed with pip or conda (using the bioconda channel), even without root permissions.

score 0 · Answer 11 · 2023-07-26

0

Entering edit mode

14 months ago

Charles-Alexandre Roy ▴ 50

Add dummy sequence IDs for a set of sequences:

awk '{print ">Seq_"NR"\n"$1}' sequences.txt > sequences.fasta

A simple way to add dummy sequence IDs if you only have the sequences themselves (1 per line) and are working with a tool that requires data in FASTA format.

ADD COMMENT • link 14 months ago by Charles-Alexandre Roy ▴ 50