TUTORIAL
ARTEMIS version 1.51
Bohdan, D. R., Bujnicki, J. M., & Baulin, E. F. (2024). ARTEMIS: a method for topology-independent superposition of RNA 3D structures and structure-based sequence alignment. Nucleic Acids Research, gkae758. DOI: 10.1093/nar/gkae758
INTRODUCTION
Installation
ARTEMIS is written in Python3 and implemented as a command-line application. To install ARTEMIS, download the repository from GitHub and install the dependencies:
git clone https://github.com/david-bogdan-r/ARTEMIS.git
cd ARTEMIS
pip install -r requirements.txt
ARTEMIS requires five libraries to be installed: numpy, pandas, scipy, matplotlib, requests. It was tested with two different Python3 environments:
1) Ubuntu 20.04: python==3.8, numpy==1.22.3, pandas==1.4.1, scipy==1.8.0, requests==2.31.0
2) MacOS Sonoma 14.0: python==3.12, numpy==1.26.3, pandas==2.1.4, scipy==1.11.4, requests==2.31.0
Usage
python3 artemis.py r=FILENAME q=FILENAME [OPTIONS]
ARTEMIS performs pairwise alignment of nucleic acid-containing 3D structures in PDB/mmCIF format. For speed, it's advised to specify the larger structure as the reference (r). Both input structure parameters (r and q) may specify a file, a folder, a mask, or a four-character PDB ID.
Quick Start
python3 artemis.py r=1ivs rres=/C q=4p5j saveto=.
In this example, ARTEMIS aligns an RNA structure from PDB entry 4p5j (q=4p5j) to an RNA chain C (rres=/C) from PDB entry 1ivs (r=1ivs) and saves the superimposed query structure to the current folder (saveto=.). When the query or the reference structure is specified by a 4-letter code, ARTEMIS automatically downloads the PDB entries from RCSB PDB. In this case, the topology-independent alignment is substantially better than the sequentially-ordered alignment, and, consequently, ARTEMIS reports both alignments. By default, ARTEMIS always reports the sequentially-ordered alignment first, and the topology-independent alignment is reported second only if its TM-score is at least 10% better than that of the sequentially-ordered one. The output of this example should look like this:
********************************************************************
* ARTEMIS (Version 1.51) *
* using ARTEM to Infer Sequence alignment *
* Reference: 10.1093/nar/gkae758 *
* Please email comments and suggestions to dav.bog.rom@gmail.com *
********************************************************************
Name of structure r: 1ivs:C
Name of structure q: 4p5j:A (to be superimposed onto structure r)
Length of structure r: 75 residues
Length of structure q: 83 residues
Aligned length= 58, RMSD= 3.47, Seq_ID=n_identical/n_aligned= 0.534
TM-score= 0.42135 (normalized by length of structure r: L=75, d0=2.68)
TM-score= 0.40612 (normalized by length of structure q: L=83, d0=2.95)
(":" denotes residue pairs of d < 5.0 Angstrom, "." denotes other aligned residues)
GGGCGGCUAGCUCAGCGG---AAGAGCGCUCGCC----UCACACGCGAGAGGUCGUAGGUUCAAGUCCUACGCC---------------G-CCC--ACCA
.::::::.:. .:::::::::::. .::::::::::. ::::::::::::::::::: : ::.
--------AGCUCGCCAGUUAGCGAGGUCUGUCUCGAC----ACGACAGAUAAU-CGGGUGCAACUCCCGCCCCUCUUCCGAGGGUCAUCGGAACC----
_______________________________________________________________________________
Alignment with permutations:
Aligned length= 63
TM-score= 0.46811 (normalized by length of structure r)
TM-score= 0.45025 (normalized by length of structure q)
RMSD= 3.19
Seq_ID=n_identical/n_aligned= 0.460
Distance table:
1ivs dist 4p5j
1.C.G.901. 5.50 1.A.U.65.
...
1.C.C.971. 7.60 1.A.A.81.
#Total CPU time is 2.86 seconds
Additionally, ARTEMIS should save six files to the working directory (saveto=.):
- 4p5j_to_1ivs.pdb - the query structure in PDB format superimposed to the reference according to the sequentially-ordered alignment;
- 4p5j_to_1ivs.png - 2D plot of the sequentially-ordered alignment;
- 4p5j_to_1ivs.tsv - list of the matched residues according to the sequentially-ordered alignment;
- 4p5j_to_1ivs_ti.pdb - the query structure in PDB format superimposed to the reference according to the topology-independent
alignment; - 4p5j_to_1ivs_ti.png - 2D plot of the topology-independent alignment;
- 4p5j_to_1ivs_ti.tsv - list of the matched residues according to the topology-independent alignment;
USAGE EXAMPLES
1.Residue specifications
ARTEMIS allows users to specify which residues should be considered as part of input structures, which residues should be ignored, and which residues should be saved as the superimposed query structure and in what format:
python artemis.py r=6ugg rresneg=#1/B q=1ivs qres=/C saveres=/A saveformat=pdb saveto=result
In this example, ARTEMIS superimposes one of the two tRNA(Val) chains from PDB entry 1IVS (q=1ivs qres=/C, chain C is considered as the query structure, and chain D is ignored) to one of the two tRNA(Asp) chains from PDB entry 6UGG (r=6ugg rresneg=#1/B, chain A is considered as the reference structure, and chain B is ignored). Then, ARTEMIS saves only the tRNA(Val) synthetase protein molecule from 1IVS (saveres=/A) as resulted from the alignment of the two RNA molecules. The superimposed protein molecule is saved in .pdb format under the "result" subfolder of the working directory (saveformat=pdb saveto=result).
The list of residue specification parameters and their example values:
- rres=/A - reference residues to select, chain A;
- rresneg=#2 - reference residues to ignore, model 2;
- qres=#1:_1_100 - query residues to select, model 1, residues 1 to 100;
- qresneg=":G :A" - query residues to ignore, all guanosines and adenosines;
- saveres=:_10B - query residues to save, residue 10 with ins.code B.
The details of the ChimeraX-like residue specification format:
[#[INT]][/[STRING]][:[STRING][_INT[CHAR|_INT]]
This is the regex pattern of the residue specification keyword. Each specification parameter can be expressed as a quoted space-separated list of keywords (e.g., qresneg=":G :A"). Insertion code can be specified only for a single residue, not for a range. The components:
- #[INT] - Model number
- /[STRING] - Chain identifier
- :[STRING][_INT[CHAR|_INT]] - Residue(s) specification:
- :STRING - Residue type
- :_INT[CHAR] - Residue number [with insertion code]
- :_INT_INT - Range of residue numbers
2.Superposition-only mode
When the two input structures have identical lengths and the perfect sequence alignment should be assumed between them (for example when comparing two models of the same nucleic acid molecule), the user can run ARTEMIS in the superposition-only mode (-superonly):
python artemis.py r=7sam q=7scq -superonly
The output of this command should look like this:
********************************************************************
* ARTEMIS (Version 1.5) *
* using ARTEM to Infer Sequence alignment *
* Reference: 10.1093/nar/gkae758 *
* Please email comments and suggestions to dav.bog.rom@gmail.com *
********************************************************************
Name of structure r: 7sam:A
Name of structure q: 7scq:C (to be superimposed onto structure r)
Length of structure r: 169 residues
Length of structure q: 169 residues
Aligned length= 169, RMSD= 14.07, Seq_ID=n_identical/n_aligned= 1.000
TM-score= 0.63014 (normalized by length of structure r: L=169, d0=5.29)
TM-score= 0.63014 (normalized by length of structure q: L=169, d0=5.29)
(":" denotes residue pairs of d < 5.0 Angstrom, "." denotes other aligned residues)
CGUGGUUGACACGCAGACCUCUUACAAGAGUGUCUAGGUGCCUUUGAGAGUUACUCUUUGCUCUCUUCGGAAGAACCCUUAGGGGUUCGUGCAUGGGCUUGCAUAGCAAGUCUUAGAAUGCGGGUACCGUACAGUGUUGAAAAACACUGUAAAUCUCUAAAAGAGACCA
....................................:::::::::::::::.::::::::::::.::::::::::::::::::::::::::::::::::::.......:::::::::::::::::::::::::::::::::::::::....:::::::.::::::::::
CGUGGUUGACACGCAGACCUCUUACAAGAGUGUCUAGGUGCCUUUGAGAGUUACUCUUUGCUCUCUUCGGAAGAACCCUUAGGGGUUCGUGCAUGGGCUUGCAUAGCAAGUCUUAGAAUGCGGGUACCGUACAGUGUUGAAAAACACUGUAAAUCUCUAAAAGAGACCA
#Total CPU time is 0.06 seconds
3.Sub-optimal matches
ARTEMIS can report non-overlapping alternative matches of the smaller query structure in the larger reference structure:
python artemis.py r=486d q=1euy addhits=0.5
In this example, ARTEMIS is asked to report all matches of the tRNA molecule from PDB entry 1EUY, chain B within the reference structure from PDB entry 486D that have the query TM-score >= 0.5 (addhits=0.5). The entry 486D contains three tRNA chains: A, C, and E. If 0 < addhits <= 1, the parameter is treated as the minimum query TM-score threshold. If addhits > 1, the parameter is treated as the maximum number of alternative matches to report. In this example, addhits=0.5 and addhits=2 will produce identical results: one optimal match of 66 residues with chain C and two sub-optimal matches of 65 and 62 residues with chains E and A respectively:
R Q ID RL QL TYPE LALI RMSD RTM QTM MATCH
486d 1euy 0 332 73 SEQORD 66 2.61 0.18408 0.60728 1.C.C.2.=1.B.G.902.,...
486d 1euy 1 266 73 SEQORD 65 2.92 0.21980 0.60861 1.E.U.1.=1.B.G.902.,...
486d 1euy 2 201 73 SEQORD 62 2.90 0.26461 0.55625 1.A.U.1.=1.B.G.902.,...
Legend:
- R - Reference structure file;
- Q - Query structure file;
- ID - Iteration ID (starting from 0);
- RL - Reference structure length (number of residues);
- QL - Query structure length (number of residues);
- TYPE - Sequentially-ordered (SEQORD) / Topology-independent (TOPIND) alignment;
- LALI - Match length Lali (number of residues);
- RMSD - Match RMSD (calculated for C3' atoms);
- RTM - Reference structure TM-scoreRNA;
- QTM - Query structure TM-scoreRNA;
- MATCH - Comma-separated list of matching residues (<r-residue>=<q-residue>).
4.Structure search
ARTEMIS can be used for screening a structure of interest against a set of structures, for example for automated database searches for similar folds:
python artemis.py r=DBfolder/ q=struct_filename -silent -tsv > matches.tsv
Here, ARTEMIS will run a search of <struct_filename> query structure (q=struct_filename) against all coordinate files within the <DBfolder> directory (r=DBfolder/). The found matches will be reported in the tsv format (-tsv, see Section 3 for the format details) and saved into a file (> matches.tsv). If ARTEMIS crashes on a particular pair of structures, it will print an error message into stderr and proceed to the next structure without stopping the entire search (-silent). If the query structure is larger than the average structure in the database, it's advised to swap reference and query inputs to improve the screening time (q=DBfolder/ r=struct_filename).
5.PDB/CIF conversion, formatting & cutting-out
ARTEMIS is a handy tool to cut out residues of interest from the coordinate files, convert between PDB and mmCIF formats, and "fix" non-canonical PDB/CIF format variants. For example, ARTEMIS automatically turns "C3*", "O1P", and "O2P" nucleotide atoms into their more common variants: "C3'", "OP1", and "OP2". The PDB-formatted output provided by ARTEMIS is readily acceptable for CASP and RNA-Puzzles submissions. Furthermore, ARTEMIS automatically downloads and saves nucleic acid structures from PDB entries in the preferred format:
python artemis.py r=8uo6 q=8uo6 -superonly -notmopt saveto=. saveres=/B saveformat=cif
In this example ARTEMIS downloads the PDB entry 8UO6 as both the reference and the query structure (r=8uo6 q=8uo6), trivially superimposes it onto itself via the Kabsch algorithm, i.e. without trying to maximize the TM-scoreRNA (-superonly -notmopt), and saves the chain B only (saveres=/B) in mmCIF format (saveformat=cif) under the working directory (saveto=.).
SEE ALSO
Demo Jupyter Notebook for ARTEMIS Python func
SQUARNA - an RNA secondary structure prediction method based on a greedy stem formation model. DOI: 10.1101/2023.08.28.555103
ARTEM - a tool for RNA tertiary motif search. DOI: 10.1093/nar/gkad605
urslib2 - a Python library for processing RNA structure data from PDB/mmCIF files & DSSR annotations. DOI: 10.1093/nar/gkad605
LORA - a dataset of long-range RNA 3D modules. DOI: 10.1093/nar/gkad605
ARTEM-KT - a dataset of kink-turn-like RNA 3D modules. DOI: 10.1101/2024.05.31.596898