6.6 years ago by
University of Nebraska
I assume you are doing this for amplicon (metagenomics) or downstream phylogenetic analysis? You can use the ARB platform to convert the files to FASTA format for your database or matrix construction. The
.arb file format is a binary file, so you could try to convert it or figure out how to parse it if you need the data in this format. I think it's just easiest to download the data from their FTP in FASTA format, they provide both their own format (ARB) and the FASTA format and the data is the same.
UPDATE (based on additional comment):
RE: completely wrong phylogenetic placement
I'm unclear why you are having problems with placing your sequences onto the reference database using phylogenetic methods. Trimming may help "refine" your phylogenetic placement, but I would first focus on looking at your alignment and making sure you are comparing homologous regions. For example, if you are looking at the 16S rRNA (SSU) sequence, are you certain you are using the correct region? If you can not infer homology in your data matrix then no amount of trimming or editing is going to help you. Once you have a aligned data matrix and you can see that your sequences are homologous, then it may help you to trim the data matrix.
RE: first problem is the conversion of a fasta file (with the truncated reference sequences) to a file with arb extension. A second problem is that the arb program needs a tree file in arb format next to the bare sequences in arb format...
I'm a little confused why you feel like you have to use the ARB platform. There are very well developed methods for working with sequence files in text format. As I mentioned previously, I think it would be in your best interest to use the FASTA files from the SILVA database instead of using the ARB platform and file format. Yes, ARB is set up to use the binary database files in their own platform, but using the FASTA files, aligning with a commonly used program (I typically use MUSCLE), and then using phylogenetic methods or a amplicon sequencing method which is phylogeneticly based (I like TopiaryExplorer) will get you a lot farther than using a self-contained system such as ARB. Bacterial Phylogeny that briefly describes my typical phylogenetic workflow.
RE: Since we are talking about tens of thousands of sequences,
manually changing each line in the text file is not an option.
There are easier ways of doing things than "manually changing each line": That is why we are here, to help you learn how to trim thousands of FASTA files in seconds and not spend months at a time manually trimming them and also possibly making errors along the way (because you'll get sick and tired of editing all those sequences and lose concentration). Trim The Fasta Title.
modified 6.6 years ago
6.6 years ago by
Josh Herr ♦ 5.7k