Question

FASTA file editor -- what common utilities would it cover?

1

Entering edit mode

9.4 years ago

Nancy Ouyang ▴ 170

(Let me know if I'm incorrect and there exists a widely useful FASTA file visual editor)

What tools do you use for working with FASTA files, and what might you want to see in a FASTA file graphical editor?

I feel like there are some simple tasks when working with FASTA files that we should have a GUI editor for by this point, something like "highlight x1 to x2 on this header in this file" or "tell me how long are the sequences in this file". Possibly this GUI should be built on top of a single package of common commandline tools.

For instance, tasks I've done recently are

view/extract a subsequence samtools faidx chrN chrN:x1-x2
count how many records are in a fasta file esl-seqstat foo.fa (as part of HMMER)
look at the header lines for a fasta file less foo.fa | grep \>
see if a sequence exactly matches another one, where match(S) = S, reverse(S), complement(S), reverse_complement(S) -> I wrote a custom biopython script python exactsearch.py file brca1-hg19.fa brca1-hg38.fa
search for sequence in a FASTA file (hampered by the newlines and reverse/complement issue) ... well, I just use the exactsearch.py script, which is a little cumbersome

What else do you commonly find yourself doing with FASTA files that a graphical editor might speed up?

If commandline tools are sufficient for you, let me know too, I might just be grumpy to be having the same manipulating-FASTA-files annoyances after not doing anything with them for a few years.

editor FASTA • 4.7k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by Nancy Ouyang ▴ 170

Ram · Answer 1 · 2014-12-05

1

Entering edit mode

9.4 years ago

SES 8.6k

Most editors store the data in a buffer so your changes are not lost, and this makes visually inspecting NGS data impractical. You could search for a pattern or edit files in a visual editor, but it would be very slow and use too much memory. It is probably best to do this work in streaming fashion at the command line. I use awk/perl to print specific lines causing issues and sed/awk/whatever to edit files, and I suspect other bioinformaticians do the same.

For extracting sequences or matching tasks, I prefer using standard tools written in a compiled language because that is faster and it is easier to remember the commands.

edit: I don't mean to sound discouraging because this type of thing could be helpful to some biologists, though a number of tools do exist for smaller data sets. I would suggest looking at Geneious, MEGA, BioEdit, UGENE (others?) to make sure there is no duplication of efforts.

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by SES 8.6k

0

Entering edit mode

Ah, thanks! Those are interesting points, and thanks for the names of those tools -- they seem quite nifty. I was thinking in the context of exploring public datasets for a short workshop I want to teach in January for friends, where I don't want to get bogged down in covering bash file manipulation, so I'll definitely check out those you mentioned that are free to use. I'm very new to the field in its current state, so I appreciate your patience!

ADD REPLY • link 9.4 years ago by Nancy Ouyang ▴ 170

0

Entering edit mode

No worries, you had very good ideas it's just that people beat you to it! Geneious is not free, but there is a free trial. MEGA just runs on Windows (and so does BioEdit, if I recall correctly). I'll try to remember other tools as well.

ADD REPLY • link 9.4 years ago by SES 8.6k