The radical SAM superfamily (RSS) is arguably the largest and most functionally diverse enzyme superfamily. Many functions (and intriguing reaction mechanisms) have been discovered; many more remain to be discovered!
RadicalSAM.org is designed to leverage “top-down” discovery of function using the EFI’s genomic enzymology web tools. The sequence similarity network (SSN) for the RSS is too large to be analyzed with Cytoscape and the RAM available on most computers so has been inaccessible to RSS community.
We generated the SSN for the entire RSS using a computer with 768GB RAM and segregated it into clusters for 1) the 20 subgroups curated by the Structure-Function Linkage Database (SFLD) and 2) many additional subgroups not curated by the SFLD.
For each subgroup, RadicalSAM.org provides:
We encourage users to submit experimentally characterized functional annotations for sequences that have not yet been curated by SwissProt so that these can be made available to the RSS community.
A Perspective has been accepted for publication in ACS Bio & Med Chem Au describing RadicalSAM.org. If you use RadicalSAM.org, please cite us:
Nils Oberg, Timothy W. Precord, Douglas A. Mitchell, and John A. Gerlt, RadicalSAM.org: A Resource to Interpret Sequence-Function Space and Discover New Radical SAM Enzyme Chemistry, ACS Bio & Med Chem Aug 2021 https://doi.org/10.1021/acsbiomedchemau.1c00048
The radical SAM superfamily (RSS) is arguably the largest and most functionally diverse enzyme superfamily. Its members contain a Fe4S4 cluster near the N-terminus of a (β/α)6-barrel domain that binds S-adenosyl methionine (SAM); one-electron reduction of the bound SAM yields Met and the 5′-deoxyadenosyl radical (5′-dAdo•). By hydrogen abstraction, the 5′-dAdo• generates a substrate radical (R•) (and 5′-deoxyadenosine) that undergoes intriguing and often complex chemistry to yield the product.
The SFLD (http://sfld.rbvi.ucsf.edu/archive/django/index.html) used an SSN to segregate the RSS into 20 subgroups with characterized functions and 22 without characterized functions. Their analysis was described in Methods in Enzymology Volume 606 in 2018: Atlas of the Radical SAM Superfamily: Divergent Evolution of Function Using an "Plug and Play" Domain, G.L. Holliday, E. Akiva, E.C. Meng, S.D. Brown, S. Calhoun, U. Pieper, A. Sali, S.J. Booker and P.C. Babbitt (doi: 10.1016/bs.mie.2018.06.004).
At the time of the SFLD's analyses (2017), the RSS included 113,776 sequences that were collected at 50% sequence identity into 10,741 representative nodes so that the SSN could be visualized and analyzed using Cytoscape (https://cytoscape.org/). The SSN is reproduced below (Figure 5 in the Atlas; minimum e-value threshold to draw edges between nodes is 1e-20), with the characterized subgroups numbered and colored as described by the SFLD.
We provide a web resource with three "democratized" genomic enzymology tools (https://efi.igb.illinois.edu/; doi: 10.1021/acs.biochem.9b00735) The tools are used to 1) explore sequence-function space in protein families using SSNs (generated with EFI-EST), 2) collect and explore genome neighborhoods for clues about functions of uncharacterized enzymes (collected with EFI-GNT), and 3) prioritize uncharacterized SSN clusters for functional assignment based on human microbiome metagenome abundance (chemically guided functional profiling with EFI-CGFP).
Unfortunately, most experimentalists interested in the RSS cannot take full advantage of the tools:
We developed RadicalSAM.org to provide lists of accession IDs (UniProt, UniRef90, and UniRef50) for the SFLD-curated subgroups and uncurated subgroups so that their SSNs can generated with EFI-EST. For most of SFLD subgroups and additional clusters, the number of UniRef50 IDs is sufficiently small that useful SSNs can be visualized and analyzed with Cytoscape using typically available computers.
RadicalSAM.org also provides information about each subgroup to aid target selection and inference of function, including the multiple sequence alignment (MSA), WebLogo, hidden Markov model (HMM), length histogram, phylogenetic distribution, SwissProt annotations, and number and locations of conserved Cys residues.
Importantly, RadicalSAM.org provides genome context (genome neighborhood networks, GNNs, and genome neighborhood diagrams, GNDs) for the bacterial and archaeal members of the subgroups, thereby providing metabolic pathway context for inferring novel enzymatic activities and physiological functions.
This release of RadicalSAM.org includes RSS sequences in UniProt Release 2020_05/InterPro Release 82 (October 7, 2020). Option B of EFI-EST was used to collect sequences from the UniRef50 database using as query a list of 172 InterPro families/domains and 1 Pfam family; the list of families is provided in the Sequence Families tab. UniRef50 clusters were used because these provide a manageable number of SSN nodes and edges (using a Mac Pro desktop with 768 GB RAM). The UniRef database is described at: https://www.uniprot.org/help/uniref.
We used a dataset containing only "Complete" UniRef50 cluster IDs filtered to include sequences with ≥140 residues (50,232 of 52,886 UniRef50 clusters; include 616,009 of 620,386 UniProt IDs). The sequences in the UniRef50/UniRef90 clusters WERE NOT length-filtered.
Future releases of RadicalSAM.org may use the "Complete" plus "Fragment" UniRef50 cluster dataset filtered to include sequences with ≥140 residues, with the benefit of providing additional genome context information. However, in this initial release, we adopted a "conservative" approach for including sequences.
Family | Short Name |
---|---|
IPR000385 | MoaA_NifB_PqqE_Fe-S-bd_CS |
IPR001989 | Radical_activat_CS |
IPR002684 | Biotin_synth/BioAB |
IPR003698 | Lipoyl_synth |
IPR003739 | Lys_aminomutase/Glu_NH3_mut |
IPR004383 | rRNA_lsu_MTrfase_RlmN/Cfr |
IPR004558 | Coprogen_oxidase_HemN |
IPR004559 | HemW-like |
IPR005839 | Methylthiotransferase |
IPR005840 | Ribosomal_S12_MeSTrfase_RimO |
IPR005909 | RaSEA |
IPR005911 | YhcC-like |
IPR005980 | Nase_CF_NifB |
IPR006463 | MiaB_methiolase |
IPR006466 | MiaB-like_B |
IPR006467 | MiaB-like_C |
IPR006638 | Elp3/MiaB/NifB |
IPR007197 | rSAM |
IPR010505 | Mob_synth_C |
IPR010722 | BATS_dom |
IPR010723 | HemN_C |
IPR011101 | DUF5131 |
IPR011843 | PQQ_synth_PqqE_bac |
IPR012726 | ThiH |
IPR012837 | NrdG |
IPR012838 | PFL1_activating |
IPR012839 | Organic_radical_activase |
IPR013483 | MoaA |
IPR013704 | UPF0313_N |
IPR013848 | Methylthiotransferase_N |
IPR013917 | tRNA_wybutosine-synth |
IPR016431 | Pyrv-formate_lyase-activ_prd |
IPR016771 | Fe-S_OxRdtase_rSAM_TM0948_prd |
IPR016779 | rSAM_MSMEG0568 |
IPR016863 | DesII |
IPR017200 | PqqE-like |
IPR017742 | Deazaguanine_synth |
IPR017833 | Hopanoid_synth-assoc_rSAM_HpnH |
IPR017834 | Hopanoid_synth-assoc_rSAM_HpnJ |
IPR019939 | CofG_family |
IPR019940 | CofH_family |
IPR020050 | FO_synthase_su2 |
IPR020612 | Methylthiotransferase_CS |
IPR022431 | Cyclic_DHFL_synthase_mqnC |
IPR022432 | Aminodeoxyfutalosine_synthase |
IPR022447 | Lys_aminomutase-rel |
IPR022459 | Lysine_aminomutase |
IPR022462 | EpmB |
IPR022881 | rRNA_lsu_MeTfrase_Cfr |
IPR022946 | UPF0313 |
IPR023404 | rSAM_horseshoe |
IPR023819 | Pep-mod_rSAM_AF0577 |
IPR023821 | rSAM_TatD-assoc |
IPR023822 | rSAM_TatD-assoc_bac |
IPR023868 | 7-CO-7-deazaGua_synth_put_Clo |
IPR023880 | Benzylsucc_Synthase_activating |
IPR023885 | 4Fe4S-binding_SPASM_dom |
IPR023886 | QH-AmDH_gsu_maturation |
IPR023891 | Pyrrolys_PylB |
IPR023897 | Spore_PP_lysase |
IPR023904 | Pep_rSAM_mat_YydG |
IPR023912 | YjjW_bact |
IPR023913 | Mycofactocin_rSAM_pep_mat |
IPR023930 | NirJ1 |
IPR023969 | CHP04072_B12-bd/rSAM |
IPR023979 | CHP04014_B12-bd/rSAM |
IPR023980 | CHP04013_B12-bd/rSAM |
IPR023992 | HemeD1_Synth_rSAM_NirJ |
IPR023993 | TYW1_archaea |
IPR023995 | HemZ |
IPR024001 | Cys-rich_pep_rSAM_mat_CcpM |
IPR024007 | FeFe-hyd_mat_HydG |
IPR024016 | CHP04064_rSAM |
IPR024017 | Pep_cycl_rSAM |
IPR024018 | CHP04083_rSAM |
IPR024023 | rSAM_paired_HxsB |
IPR024025 | SCIFF_rSAM_maturase |
IPR024025 | SCIFF_rSAM_maturase |
IPR024177 | Biotin_synthase |
IPR024560 | UPF0313_C |
IPR024924 | 7-CO-7-deazaguanine_synth-like |
IPR025895 | LAM_C_dom |
IPR026322 | Geopep_mat_rSAM |
IPR026332 | HutW |
IPR026335 | SAM_SPASM_FxsB |
IPR026344 | SCM_rSAM_ScmE |
IPR026346 | SCM_rSAM_ScmF |
IPR026357 | rSAM/SPASM_prot_GRRM_system |
IPR026404 | rSAM_w_lipo |
IPR026407 | SAM_GG-Bacter |
IPR026412 | rSAM_Cxxx_rpt |
IPR026423 | rSAM_cobopep |
IPR026423 | rSAM_cobopep |
IPR026426 | rSAM_FibroRumin |
IPR026429 | MIA_synthase |
IPR026447 | B12_SAM_Ta0216 |
IPR026482 | rSAM_nif11_3 |
IPR027492 | RNA_MTrfase_RlmN |
IPR027526 | Lipoyl_synth_chlpt |
IPR027527 | Lipoyl_synth_mt |
IPR027564 | HpnR_B12_rSAM |
IPR027570 | GeoRSP_rSAM |
IPR027583 | rSAM_ACGX |
IPR027586 | rSAM_metal_mat |
IPR027596 | AmmeMemoSam_rS |
IPR027604 | W_rSAM_matur |
IPR027609 | rSAM_QueE_Proteobac |
IPR027621 | rSAM_QueE_gams |
IPR027633 | rSAM_NirJ2 |
IPR030801 | Glu_2_3_NH3_mut |
IPR030837 | B12_rSAM_cofa1 |
IPR030896 | rSAM_AhbD_hemeb |
IPR030905 | CutC_activ_rSAM |
IPR030915 | rSAM_SkfB |
IPR030933 | Non_iron_rSAM |
IPR030969 | B12_rSAM_trp_MT |
IPR030977 | QueE_Cx14CxxC |
IPR030989 | rSAM_XyeB |
IPR031003 | BcpD_PhpK_rSAM |
IPR031012 | rSAM_mob_pairB |
IPR031015 | Arg_2_3_am_muta |
IPR031691 | LIAS_N |
IPR031691 | LIAS_N |
IPR032432 | Radical_SAM_C |
IPR033971 | Avilamycin_epimerase |
IPR033974 | Glycerol_dehydratase_activase |
IPR033975 | ThnP-like |
IPR033976 | GntE-like |
IPR034165 | NifB_C |
IPR034386 | BtrN-like |
IPR034391 | Cmo-like_SPASM_containing |
IPR034405 | F420 |
IPR034422 | HydE/PylB-like |
IPR034428 | ThiH/NoCL/HydG-like |
IPR034436 | NocN/NosN-like |
IPR034438 | 4-hPhe_decarboxylase_activase |
IPR034457 | Organic_radical-activating |
IPR034462 | Benzylsuc_synthase_activase |
IPR034465 | Pyruvate_for-lyase_activase |
IPR034466 | Methyltransferase_Class_B |
IPR034471 | 7_8-dihydro-6-hydroxymethylpte |
IPR034474 | Methyltransferase_Class_D |
IPR034479 | AhbC-like |
IPR034480 | Heme_carboxy_lyase-like |
IPR034485 | Anaerobic_Cys-type_sulfatase-m |
IPR034491 | Anaerob_Ser_sulfatase-maturase |
IPR034497 | Bacteriochlorophyll_C12_MT |
IPR034498 | Bacteriochlorophyll_C8_MT |
IPR034505 | Coproporphyrinogen-III_oxidase |
IPR034508 | Spectinomycin_biosynthesis |
IPR034514 | ThnK-like |
IPR034515 | ThnL-like |
IPR034529 | Fom3-like |
IPR034530 | HpnP-like |
IPR034531 | Methylation_of_yatakemycin |
IPR034534 | Pyrimidine_methyltransferase |
IPR034547 | Tte1186a_maturase |
IPR034556 | tRNA_wybutosine-synthase |
IPR034557 | ThrcA_tRNA_MEthiotransferase |
IPR034559 | Spore_PP_lysase_Clostridia |
IPR034560 | Spore_PP_lysase_Bacilli |
IPR034687 | ELP3-like |
IPR038135 | Methylthiotransferase_N_sf |
IPR039661 | ELP3 |
IPR039661 | ELP3 |
IPR040063 | QhpD-like/Tte1186a |
IPR040072 | Methyltransferase_A |
IPR040074 | BssD/PflA/YjjW |
IPR040081 | CndI-like |
IPR040082 | GenK-like |
IPR040085 | MJ0674-like |
IPR040086 | MJ0683-like |
PF13186 | SPASM |
UniProt designates the "Sequence Status" for each sequence: "Complete" if the encoding DNA sequence includes both a start and stop codon; "Fragment" if one or both of these codons is/are absent. A "Fragment" may result if the coding DNA sequence is at the end of a contig. A "Complete" sequence need not to be a "full length" sequence, e.g., it may be truncated as the result of sequencing errors that produce incorrect start and/or stop codons.
Option B identified 664,196 "Complete" and "Fragment" UniProt sequences in 66,428 UniRef50 clusters; these represent 579,102 unique sequences (100% sequence identity over 100% of the length). Option B identified 620,386 "Complete" sequences in 52,886 UniRef50 clusters; these represent 535,892 unique sequences.
We remove "short" (truncated) sequences from our datasets to 1) improve the quality of the MSAs used to generate WebLogos and HMMs and 2) minimize the number of singletons in SSNs generated with alignment scores that collect sequences into "isofunctional" clusters. We designate sequences as "short" if they cannot encode a functional RSS enzyme. By inspecting the UniProt ID length histograms for the subgroups in RadicalSAM.org (generated using the Cluster Analysis utility of EFI-EST), we identified the anaerobic ribonucleotide-triphosphate reductase activating enzyme family with members that contain ≥140 residues (Megacluster-3-3-1) as the "shortest" functional RSS family.
Therefore, we used UniRef50 IDs that contain ≥140 residues to construct the SSN for the RSS.
It is instructive for the user to be familiar with the length distribution of the UniRef50 IDs (i.e., nodes) in our datasets and resulting SSNs. The length histograms (UniProt IDs, UniRef90 cluster IDs, and UniRef50 cluster IDs) for the "Complete" and "Complete" plus "Fragment" datasets are shown below.
With its large number of sequences, the RSS requires a large-scale approach for identifying subgroups with related functions. Following the strategy used by the SFLD, we generated the SSN for the RSS using UniRef50 cluster IDs and then segregated it into clusters, some defining the 20 SFLD-curated subgroups and others defining uncharacterized sequences. In this section, we describe our procedure for identifying subgroups.
Given the large numbers of sequences and SSN clusters, we cannot guarantee that the subgroup segregation is "perfect". Indeed, we expect that it is not. However, the identified subgroups provide manageable starting points for the discovery of novel functions using EFI-EST, EFI-GNT, and EFI-CGFP.
Using an alignment score of 11 that groups UniRef50 clusters into identifiable SFLD subgroups and minimizes the number of singletons (visual inspection), the SSN contains 50,232 nodes and 41,476,118 edges. We used a Mac Pro desktop computer with 768 GB RAM to visualize and manipulate the SSN using Cytoscape 3.8.2.
The large cluster (50,084 nodes and 41,476,016 edges) was selected for identification of the subgroups.
The nodes associated with SFLD subgroups were identified by coloring the nodes according to the InterPro family (F)/domain (D) that includes the SFLD subgroup (Table 1); the node colors for the SFLD-curated subgroups are those used by the SFLD. Clusters associated with four additional families/domains described by InterPro were also colored.
Subgroup | Subgroup Name | IPR # | Color |
---|---|---|---|
1 | 7-carboxy-7-deazaguanine synthase-like (F) | IPR024924 | Teal |
2 | Coproporphyrinogen III oxidase-like (F) | IPR034505 | Red |
3 | Antiviral proteins (viperin) (F) | -- | -- |
4 | Avilamycin synthase (F) | IPR033971 | Pink |
5 | B12-binding domain containing (D) | IPR006158 | Blue |
6 | BATS domain containing (D) | IPR010722 | Orange |
7 | DesII-like (F) | IPR016863 | Mauve |
8 | ELP3/YhcC (F) | IPR039661 | Black |
9 | F420, menaquinone cofactor biosynthesis (F) | IPR034405 | Purple |
10 | FeMo-cofactor biosynthesis protein (F) | IPR005980 | Mint green |
11 | Lipoyl synthase like (F) | IPR003698 | Yellow |
12 | Methylthiotransferase (D) | IPR013848 | Verdun green |
13 | Methyltransferase Class A (F) | IPR040072 | Dark brown |
14 | Methyltransferase Class D (F) | IPR034474 | Light pink |
15 | Organic radical activating enzymes (F) | IPR034457 | Cyan |
16 | PLP-dependent (F) | IPR003739 | Dark green |
17 | SPASM/twitch domain containing (D) | IPR023885 | Magenta |
18 | Spectinomycin biosynthesis (F) | IPR034508 | White |
19 | Spore photoproduct lyase (F) | IPR023897 | Green |
20 | tRNA wybutosine-synthesizing (F) | IPR034556 | Brown |
Protein MJ0683-like (F) | IPR040086 | Electric lime | |
Uncharacterized protein family UPF0313 (F) | IPR022946 | Olive | |
DUF5131 (F) | IPR011101 | Light purple | |
3',8-Cyclase/Mo cofactor synthesis (D) | IPR010505 | Dodger blue |
In contrast to the SFLD's SSN (Background tab), the clusters containing the subgroups are not separated in this SSN (a single cluster!), the result of both the larger number of nodes and the choice of a smaller alignment score to both prevent splitting SFLD subgroups into multiple clusters and, also, reduce the number of singletons (the SFLD used an edge threshold of 1e-20).
The SFLD subgroups were separated by selection/deletion of "long" SSN edges ("remote" sequence relationships). This editing is subjective, but we know of no other practical strategy to separate the subgroups. That 1) the nodes associated with the various subgroups colocalize in the starting SSN and 2) their colocalization is maintained in the editing supports the validity of this approach.
In several of the SFLD subgroups, some nodes are grey because they are not recognized by the HMMs used by InterPro (many of the HMMs were generated when the RSS was significantly smaller/less diverse). As a result, the InterPro families (and their HMMs) cannot be used to provide the members of the subgroups.
The nodes associated with either SFLD Subgroup 5, B12-binding domain (blue nodes), or SFLD Subgroup 17, SPASM/Twitch-domain (magenta nodes), are not colocalized in a "spherical", well-organized cluster because the sequence-function space is more diverse than that for the other subgroups. As described in the Functionally Diverse Subgroups tab, RadicalSAM.org provides a strategy to identifying isofunctional clusters in these subgroups.
The resulting "edge-edited" SSN contained 10 clusters, five "megaclusters" containing multiple SFLD subgroups [(mega)clusters are numbered in order of decreasing number of UniRef50 IDs/nodes,1 through 5] and five clusters containing a single SFLD subgroup or InterPro family (numbered in order of decreasing UniRef50 IDs/nodes, 6 through 10).
As described in the tabs, Megaclusters-1, -2, -3, -4, and -5 were segregated into component subgroups.
Megacluster-1 also contains the large GTP 3’,8-cyclase family (Megacluster-1-2; "Dodger blue" nodes). The sequences in Megaclusters-1-2, -1-3, -1-4, and 1-5 were identified as discrete clusters in the UniRef50 SSN for Megacluster-1 generated with an alignment score 30; these were removed from Megacluster-1 to generate the SSN for Subgroup 17 (SPASM/Twitch domain; Megacluster-1-1; magenta nodes).
Two uncharacterized clusters "loosely" connected to Megacluster-1 were segregated by manual edge deletion (Megacluster-1-6 and Megacluster-1-7; grey nodes). Megacluster-1-6 contains the Swiss-Prot curated ]methyl-coenzyme M reductase subunit alpha]-arginine C-methyltransferase function.
The UniRef50 SSN for the resulting Megacluster-1-1 is displayed below.
The UniRef50 SSN for the additional four SFLD subgroups, GTP 3’,8-cyclases, and loosely connected clusters is displayed below (generated with an alignment score of 18 and edge-edited).
Megacluster-2 contains four SFLD subgroups: Subgroup 5 (B12-binding domain, Megacluster-2-1, blue nodes); Subgroup 2 (Coproporphyrinogen III oxidase-like, Megacluster-2-2, red nodes), Subgroup 12 (Methylthiotransferase, Megacluster-2-3, Verdun green nodes); and Subgroup 8 (Elongator protein 3, Megaclusters-2-4 and -2-5, black nodes).
The UniRef50 SSN for Megacluster-2 was generated with an alignment score of 14 (increase from the initial alignment score of 11); the "long" edges were deleted to segregate the clusters/subgroups. The clusters are numbered in order of decreasing number of UniRef50 IDs/nodes.
Megacluster-2-4 was segregated into subclusters as indicated on the Explore page for the cluster.
Megacluster-3 contains two SFLD subgroups: Subgroup 1 (7-carboxy-7-deazaguanine synthase-like, Megacluster-3-1, teal nodes) and Subgroup 2 (Organic radical activating enzymes, Megacluster-3-2; cyan nodes).
The UniRef90 IDs in Megacluster-3 were used to generate the SSN with an alignment score of 20 (increase from the initial alignment score of 11); the "long" edges were deleted to segregate the clusters/subgroups. The increase in node resolution to UniRef90 ensures that the subclusters obtained from "edge-editing" will be isofunctional. The clusters are numbered in order of decreasing number of UniRef90 IDs/nodes.
Where indicated on their Explore pages, several of Mega-3-N clusters were further segregated so that different SwissProt functions are located in distinct subclusters.
Megacluster-4 contains two SFLD subgroups: Subgroup 10 (F420, menaquinone cofactor biosynthesis, Megacluster-4-1; purple nodes) and Subgroup 6 (BATS domain containing, Megaclusters-4-2, -4-3, -4-5, and -4-10, orange nodes).
The UniRef90 IDs in Megacluster-4 were used to generate the SSN with an alignment score of 22 (increase from the initial alignment score of 11); the "long" edges were deleted to segregate the clusters/subgroups. The increase in node resolution to UniRef90 ensures that the subclusters obtained from "edge-editing" will be isofunctional. The clusters are numbered in order of decreasing number of UniRef90 IDs/nodes.
Where indicated on their Explore pages, several of Mega-4-N clusters were further segregated so that different SwissProt functions are located in distinct subclusters.
Megacluster-5 contains one SFLD subgroup and two additional InterPro families: Protein MJ0683-like (Megacluster-5-1,electric lime nodes); DUF5131 (Megacluster-5-2; light purple nodes), and Subgroup 19 (Spore photoproduct lyase, Megacluster-5-3, green nodes).
The UniRef50 SSN for Megacluster-5 was generated with an alignment score of 14 (increase from the initial alignment score of 11); the "long" edges were deleted to segregate the clusters/subgroups. The clusters are numbered in order of decreasing number of UniRef50 IDs/nodes.
Four clusters each contained one SFLD subgroup and the fifth contained one InterPro family: Subgroup 13 (Methyltransferase Class A, Cluster-6, dark brown nodes); Subgroup 16 (PLP-dependent, Cluster-7, dark green nodes); Subgroup 11 (Lipoyl synthase like, Cluster-8, yellow nodes); Subgroup 19 (tRNA wybutosine-synthesizing, Cluster-9, brown nodes); and UPF0313 (Cluster-10, olive nodes).
These clusters were used without further editing. The clusters are numbered in order of decreasing number of UniRef50 IDs/nodes.
SFLD subgroup 2 (Megacluster-2-2; oxygen-independent coproporphyrinogen III oxidase like; red nodes in the SSN), SFLD subgroup 5 (Megacluster-2-1; B12-binding domain; blue nodes), SFLD subgroup 16 (Cluster-7; PLP-dependent; green nodes) and SFLD subgroup 17 (Megacluster-1-1; SPASM/Twitch domain-containing; magenta nodes) are large and functionally diverse. Segregation of the SSNs for these subgroups into isofunctional families/clusters is not easy, e.g., a single alignment score threshold cannot be used to segregate the SSN into isofunctional clusters.
The SSN for the RSS was generated with UniRef50 clusters/IDs. This coarse granularity is sufficient for identifying subgroups (SFLD-curated and uncharacterized clusters). However, these nodes that conflate sequences sharing ≥50% sequence identity can result in nodes and SSN clusters that are heterofunctional even as the alignment scores is increased, thereby confusing interpretation of MSAs and genome context. Therefore, UniRef90 clusters/IDs were used to generate the SSNs for these subgroups to maximize the likelihood that as the alignment score is increased, the SSN clusters will become isofunctional.
Segregation of the SSNs for these subgroups into isofunctional families/clusters is not easy, e.g., a single alignment score threshold cannot be used to segregate the SSN into isofunctional clusters.
To solve this problem, for each subgroup a series of SSNs was generated with increasing alignments scores. As the alignment score increases, both the sizes of the clusters and their functional complexity decrease, with isofunctional clusters segregating at alignment scores that are characteristic of the function. We refer to this strategy as "dicing".
The "AS Walk-Through" function is provided so that the user can 1) "walk" forward from any cluster to its progeny clusters in the SSN with the next alignment score or 2) "walk" backward to its progenitor cluster in in the SSN with the previous alignment score. These connections may allow the discovery of divergent functions that share mechanistic attributes.
The "AS Walk-Through" window provides SwissProt and user-provided annotations, if available, for the progenitor cluster and progeny clusters. These provide "landmarks" for exploring sequence-function space.
For each cluster in each SSN, the Explore page provides information to assess whether the cluster is isofunctional, e.g., convergence ratio, number of conserved Cys residues, and, most importantly, genome neighborhood diagrams (GNDs) for the sequences in the clusters.
As described in the description of the Search function, the clusters in the SSNs can be searched with a UniProt ID or sequence.
The UniRef50 nodes/IDs in Megacluster-1-1 (obtained as described in the Megacluster-1 subtab of the Subgroup tab) were expanded to UniRef90 nodes/IDs. The SSN was generated with Option D of EFI-EST followed by analyses using a series of 33 alignment scores (from 25 to 70 increments of 5, from 80 to 300 in increments of 10).
The Explore page for Megacluster-1-1 displays the SSN generated with an alignment score of 11 and provides access various types of bioinformatic information about Megacluster-1-1. The Explore page provides a link to the Diced SSNs page. The clusters in each of the 33 "diced" SSNs can be viewed by selecting the alignment score (a cluster contains ≥3 UniRef90 IDs/nodes). As the alignment score increases, the clusters decrease in size and complexity. Also, as the alignment score increases, the number of clusters initially increases as the large clusters segregate and then decreases as the small clusters "dissociate" into individual nodes.
The Click here link on the Diced SSNs page accesses the Explore page for Megacluster-1-1-1 in the SSN generated with an alignment score of 25. On that (and any) Explore page, any cluster in the current SSN can be selected; also, SSNs generated with other alignment scores can be selected.
Each Explore page includes the "AS Walk-Through" button above the image for the cluster. The "AS Walk-Through" function allows the user to "walk through" the series of "diced" SSNs, allowing identification of the progeny of a cluster (walking forward) or the progenitor of a cluster (walking backward). This function allows, for example, analyses of 1) speciation of orthologues (with the taxonomic distribution of the cluster available via the TAXONOMY button) and 2) divergent evolution of functions from a common progenitor.
Clicking the "AS Walk-Through" button opens a window that identifies 1) the cluster in the previous SSN in the series that contained the sequences in the cluster and 2) the cluster(s) in the next SSN that contain(s) the sequences in the cluster. For each cluster in the window, the number of nodes and CR are provided as well as SwissProt functions and user-contributed annotation, if these are available. The cluster is a link to the cluster; clicking the link opens the Explore page for the cluster.
Several "diced" clusters that lack conserved Cys motifs in a C-terminal domain can be identified (Megacluster-1-1 was identified using pairwise sequence similarity using all UniRef50 IDs in the RSS, so some "outliers" can be expected). And, although Subgroup 17 has been designed as "SPASM/Twitch domain-containing" (PF13186 and IPR023885), inspection of the Explore pages reveals that clusters contain a wide variety of C-terminal (and N-terminal) domain Cys-rich motifs, ranging in number from 1 to 27 Cys residues (the paradigm SPASM domain contains 8 Cys residues for two Fe4S4 clusters; the paradigm Twitch domain contains 4 Cys for one Fe4S4 cluster1).
1 doi: 10.1128/JB.00040-11
doi: 10.1074/jbc.R114.581249
doi: 10.1016/j.bbamcr.2015.01.002
doi: 10.1074/jbc.RA118.005369
The UniRef50 IDs in Megacluster-2-1 (obtained as described in the Megacluster-2 Subgroup tab) were expanded to UniRef90 nodes/IDs. The SSN was generated with Option D of EFI-EST followed by analyses using a series of 30 alignment scores (from 35 to 325 in increments of 10).
The Explore page for Megacluster-2-1 displays the SSN generated with an alignment score of 11 and provides access various types of bioinformatic information about Megacluster-2-1. The Explore page provides a link to the Diced SSNs page. The clusters in each of the 30 "diced" SSNs can be viewed by selecting the alignment score (a cluster contains ≥3 UniRef90 IDs/nodes). As the alignment score increases, the clusters decrease in size and complexity. Also, as the alignment score increases, the number of clusters initially increases as the large clusters segregate and then decreases as the small clusters "dissociate" into individual nodes.
The Click here link on the Diced SSNs page accesses the Explore page for Megacluster-2-1-1 in the SSN generated with an alignment score of 35. On that (and any) Explore page, any cluster in the current SSN can be selected; also, SSNs generated with other alignment scores can be selected.
Each Explore page includes the "AS Walk-Through" button above the image for the cluster. The "AS Walk-Through" function allows the user to "walk through" the series of "diced" SSNs, allowing identification of the progeny of a cluster (walking forward) or the progenitor of a cluster (walking backward). This function allows, for example, analyses of 1) speciation of orthologues (with the taxonomic distribution of the cluster available via the TAXONOMY button) and 2) divergent evolution of functions from a common progenitor.
Clicking the "AS Walk-Through" button opens a window that identifies 1) the cluster in the previous SSN in the series that contained the sequences in the cluster and 2) the cluster(s) in the next SSN that contain(s) the sequences in the cluster. For each cluster in the window, the number of nodes and CR are provided as well as SwissProt functions and user-contributed annotation, if these are available. The cluster is a link to the cluster; clicking the link opens the Explore page for the cluster.
Several "diced" clusters that lack the B12-binding domain (PF02310) can be identified (Megacluster-2-1 was identified using pairwise sequence similarity using all UniRef50 IDs in the RSS, so some "outliers" can be expected).
The UniRef50 nodes/IDs in Megacluster-2-2 (obtained as described in the Megacluster-2 subtab of the Subgroup tab) were expanded to UniRef90 nodes/IDs. The SSN was generated with Option D of EFI-EST followed by analyses using a series of 20 alignment scores (60 to 250 in increments of 10).
The Explore page for Megacluster-2-2 displays the SSN generated with an alignment score of 11 and provides access various types of bioinformatic information about Megacluster-2-2. The Explore page provides a link to the Diced SSNs page. The clusters in each of the 20 "diced" SSNs can be viewed by selecting the alignment score (a cluster contains ≥3 UniRef90 IDs/nodes). As the alignment score increases, the clusters decrease in size and complexity. Also, as the alignment score increases, the number of clusters initially increases as the large clusters segregate and then decreases as the small clusters "dissociate" into individual nodes.
The Click here link on the Diced SSNs page accesses the Explore page for Megacluster-2-2-1 in the SSN generated with an alignment score of 60. On that (and any) Explore page, any cluster in the current SSN can be selected; also, SSNs generated with other alignment scores can be selected.
Each Explore page includes the "AS Walk-Through" button above the image for the cluster. The "AS Walk-Through" function allows the user to "walk through" the series of "diced" SSNs, allowing identification of the progeny of a cluster (walking forward) or the progenitor of a cluster (walking backward). This function allows, for example, analyses of 1) speciation of orthologues (with the taxonomic distribution of the cluster available via the TAXONOMY button) and 2) divergent evolution of functions from a common progenitor.
Clicking the "AS Walk-Through" button opens a window that identifies 1) the cluster in the previous SSN in the series that contained the sequences in the cluster and 2) the cluster(s) in the next SSN that contain(s) the sequences in the cluster. For each cluster in the window, the number of nodes and CR are provided as well as SwissProt functions and user-contributed annotation, if these are available. The cluster is a link to the cluster; clicking the link opens the Explore page for the cluster.
The UniRef50 IDs in Cluster-7 (obtained as described in the Clusters subtab of the Subgroup tab) were expanded to UniRef90 nodes/IDs. The SSN was generated with Option D of EFI-EST followed by analyses using a series of 20 alignment scores (60 to 250 in increments of 10).
The Explore page for Cluster-7 displays an image of the SSN generated with an alignment score of 11 and provides access various types of bioinformatic information about Cluster-7. The Explore page provides a link to the Diced SSNs page. The clusters in each of the 20 "diced" SSNs can be viewed by selecting the alignment score (a cluster contains ≥3 UniRef90 IDs/nodes). As the alignment score increases, the clusters decrease in size and complexity. Also, as the alignment score increases, the number of clusters initially increases as the large clusters segregate and then decreases as the small clusters "dissociate" into individual nodes.
The Click here link on the Diced SSNs page accesses the Explore page for Cluster-7-1 in the SSN generated with an alignment score of 60. On that (and any) Explore page, any cluster in the current SSN can be selected; also, SSNs generated with other alignment scores can be selected.
Each Explore page includes the "AS Walk-Through" button above the image for the cluster. The "AS Walk-Through" function allows the user to "walk through" the series of "diced" SSNs, allowing identification of the progeny of a cluster (walking forward) or the progenitor of a cluster (walking backward). This function allows, for example, analyses of 1) speciation of orthologues (with the taxonomic distribution of the cluster available via the TAXONOMY button) and 2) divergent evolution of functions from a common progenitor.
Clicking the "AS Walk-Through" button opens a window that identifies 1) the cluster in the previous SSN in the series that contained the sequences in the cluster and 2) the cluster(s) in the next SSN that contain(s) the sequences in the cluster. For each cluster in the window, the number of nodes and CR are provided as well as SwissProt functions and user-contributed annotation, if these are available. The cluster is a link to the cluster; clicking the link opens the Explore page for the cluster.
This section describes the Search function results when the input UniProt ID or sequence matches a sequence located in a cluster (≥3 UniRef90 nodes) in the functionally diverse subgroups 2, 5, 16, and 17. Refer to the description of the Search tab for results when the input matches one of the other subgroups.
With the Find by UniProt ID function, the user provides a UniProt ID for searching all of the clusters in RadicalSAM.org. When the ID is located in the "diced" SSNs of a functionally diverse subgroup, the Results page provides a list of clusters in the "diced" SSNs that contain the UniProt ID along with the number of UniProt IDs, number of cluster nodes, and UniProt ID convergence ratio (CR; described on the Explore Pages tab). The clusters are links to the Explore page for that cluster (see Explore Pages tab for a description of the information provided on the Explore page).
As the alignment score used to generate the diced clusters increases, a UniProt ID may located in a cluster with ≤2 UniRef90 ID nodes. When this occurs, the Search will report "ID not found". Generation of the MSA, WebLogo, HMM, Length Histograms, tables of Conserved Cys Residues, and files with IDs and FASTA sequences for a cluster requires that the cluster contain ≥3 UniRef90 nodes.
With the Find by Sequence function, the user provides a sequence (with/without a FASTA header) for first searching the HMMs of the subgroups to identify the subgroup with smallest e-value. If the smallest e-value is for a functionally diverse subgroup (SFLD subgroup 2, 5, 16, or 17; Megaclusters-1-1, -2-1, or -2-2 or Cluster-7), the HMMs for all of the clusters in the series of "diced" SSNs in that subgroup are searched.
The Results page provides a list of the three clusters at each alignment score with the smallest e-values along with the number of UniProt IDs, number of cluster nodes, and UniProt ID convergence ratio (CR; described on the Explore Pages tab). The cluster identifiers are links to the Explore page for that cluster (see Explore Pages tab for a description of the information provided on the Explore page).
As the alignment score used to generate the "diced" SSN increases, the e-value typically decreases as the alignment score increases and the cluster becomes orthologous/isofunctional as nonorthologous sequences are removed. As the alignment score used to generate the diced clusters increases and if/when the user-provided sequence segregates into a cluster with ≤2 UniRef90 ID nodes for which an HMM is not generated, the Search results will continue to identify the three best clusters but the e-values likely will be larger than those identified for clusters containing the sequence.
An image of the cluster is presented on its Explore pages: an isofunctional cluster likely will be "spherical", with each node/sequence connected to all other nodes with an edge so the value of CR will approach 1.0. Functional homogeneity within each cluster also can be assessed by inspection of the Conserved Cys Residue table as well as the genome neighborhood diagrams (GNDs) for the UniRef90 node IDs and the UniProt IDs.
An Explore page is provided for each SSN cluster in RadicalSAM.org.
Convergence ratio (CR): The CR is the ratio of the number of sequence pairs with edge alignment score values (derived from BLAST e-values/bit scores) ≥ the minimum alignment score threshold used to generate the SSN to the total number of sequence pairs. The value of CR ranges from 1.0 for sequences that are very similar ("identical") to 0.0 for sequences that are unrelated at the specified alignment score.
At small values of the alignment score, the value of CR for a cluster can be ~ 1.0 even if the cluster is heterofunctional (e-values are large; pairwise sequence similarity is small). However, at larger values of the alignment score, clusters with values approaching 1.0 are likely to be isofunctional (e-values are small; pairwise sequence identity is large).
Isofunctional clusters, as judged by shared genome context in the GNDs, often have CR values that approach 1.0. However, the values of CR for isofunctional/orthologous clusters that contain sequences from phylogenetically diverse species can decrease with increasing alignment score as the sequence divergence between orthologues in different phylogenetic groups cause the CR to decrease. In such situations, as the alignment score increases, an isofunctional cluster with a low value for CR can segregate into smaller clusters for different phylogenetic groups with CR values that approach 1.0. Thus, values of CR that approach 1.0 are not required for isofunctionality; inspection of the GNDs allows that assessment
Conserved Cys Residues: A list is provided of the number of Conserved Cys Residues as a function of percent conservation in the MSA. Recall that, by definition, members of the RSS share a Cx3Cx2C motif for SAM-binding. Conserved Cys residues in excess of 3 may be associated with additional FeS clusters, e.g., members of the SFLD subgroup 17, SPASM/Twitch domain.
The sequences in the cluster are not edited prior to construction of the MSA, so some will be truncated (even if their Sequence Status is "Complete"). Also, if the alignment score threshold is less than that required for isofunctionality, the sequences in the MSA will be heterofunctional and, therefore, heterogeneous in length. Therefore, the Number of Conserved Cys Residues can be expected to be a function of percent conservation, with the most abundant conserved Cys motifs represented at large values of percent conservation and conserved Cys motifs in less abundant sequences/functions represented at lower values of percent conservation. Therefore, with the caveat that the sequences in the cluster cannot be expected to be uniform in length, this summary can be used to evaluate sequence and function heterogeneity.
The Conserved Cys Residues can/should be used together with the MSA, CR, length histograms, and GNDs in identifying isofunctional clusters.
Consensus Cys Residues: A text file ("Consensus residue percentage summary table") is available for download. In the MSA (SSN cluster number in column 1), the positions of Cys residues identified at 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%, and 10% conservation (column 2) are identified in columns 6 and greater. The number in column 3 is the number of Cys residues conserved that percent conservation columns 4 and 5 provide the number of IDs in the cluster. The canonical Cx3Cx2C is easily identified; additional conserved Cys residues may provide ligands to auxilliary Fe-S centers.
MSAs and WebLogos: The MSA for all UniRef node IDs is generated using MUSCLE. The WebLogo calculated from the MSA is displayed. The conserved Cys residues are colored in red to allow easy identification of the canonical SAM-binding Cx3Cx2C motif as well other conserved Cys residues. Their height is proportional to percent conservation; their intensity is determined by residue occupancy in the MSA.
The MSA is available for download and can be viewed with Jalview (https://www.jalview.org/). The WebLogo is available as a png file.
HMMs: The HMM for the cluster generated from the MSA using hmmbuild is provided. It can be viewed as an image generated by Skylign (https://skylign.org/). The image can be downloaded as a png file. The text file for the HMM is available for download and use for identifying sequences in the UniProt database using hmmsearch (https://www.ebi.ac.uk/Tools/hmmer/search/hmmsearch).
Genome Neighborhood Diagrams (GNDs): The Genome Neighborhood Diagrams button provides genome neighborhood diagrams (GNDs) for the node IDs in each cluster. These are displayed using the GND Explorer used by the EFI-GNT tool. The GNDs provide information about both functional heterogeneity (one or several genome neighborhoods; one or more functions) and possible metabolic pathways (Pfam/InterPro families of proximal genes).
For UniRef50 SSN clusters, the default GND display is UniRef50 node IDs in the cluster; for UniRef90 SSN clusters, the default display is UniRef90 nodes IDs in the cluster.
The GNDs for the UniRef90 IDs in each UniRef50 node are available by clicking the "+" link adjacent to each UniRef50 GND.
The GNDs for the UniProt IDs in each UniRef90 node are available by clicking the "+" link adjacent to each UniRef90 GND.
Three Search functions are provided for locating clusters (with ≥3 UniRef IDs/nodes) using a UniProt ID, Sequence, or taxonomy as query.
Find by UniProt ID
The user inputs a UniProt ID to identify the cluster in which it is located. The search opens the Explore page for the cluster that contains the UniProt ID.
In some SSNs, the query UniProt ID may be located in a cluster with ≤2 nodes or represented as a singleton. The Search will report: "ID not found".
If the cluster is in Megacluster-1-1 (SFLD Subgroup-17, SPASM/Twitch domain), Megacluster-2-1 (SFLD Subgroup 5, B12-binding domain), Megacluster-2-2 (SFLD Subgroup 2, anaerobic coproporphyrinogen-III oxidase-like), or Cluster-7 ((SFLD Subgroup 16, PLP-dependent), the clusters in the "diced" SSNs for the (mega)cluster (series of increasing alignment scores) are listed. The number of UniProt IDs, number of cluster nodes, and UniProt ID convergence ratio (CR; described on the Subgroups tab) are provided for each identified cluster. In the "AS Walk-through" pop-up window, the identity of the cluster containing the UniProt ID is given next to the cluster name in the next SSN.
Find by Sequence
The user inputs a sequence (with/without a FASTA header) that is used to query the HMMs for all clusters (≥3 UniRef IDs/nodes) using hmmscan.
The Results page reports the three clusters with the smallest e-values. The cluster number is a link to the Explore page for the cluster.
If the cluster with the lowest e-value is in Megacluster-1-1 (SFLD Subgroup-17, SPASM/Twitch domain), Megacluster-2-1 (SFLD Subgroup 5, B12-binding domain), Megacluster-2-2 (SFLD Subgroup 2, anaerobic coproporphyrinogen-III oxidase-like), or Cluster-7 ((SFLD Subgroup 16, PLP-dependent), the three clusters with the smallest e-values in each of the "diced" SSNs are listed. The number of UniProt IDs, number of cluster nodes, and UniProt ID convergence ratio (CR; described on the Subgroups tab) are provided for each identified cluster. In the "AS Walk-through" pop-up window, the e-value for the best match to the input sequence is given next to the cluster name in the next SSN.
The "Exploring Subgroups" subtab of the Functional Diverse Subgroups tab provides additional information for exploring these functionally diverse (mega)clusters.
GND Lookup
Genome neighborhood diagrams (GNDs) are provided for SSN clusters using the EFI-GNT tool. They also are available on the Explore pages for clusters. However, users may find it convenient to access the GNDs for specific members of the RSS in RadicalSAM.org.
This search function allow the user to input a list of UniProt IDs; the GNDs for these are provided.
Find by Taxonomy
The user provides the genus/species/strain for an organism to identify the clusters in which its members are located.
If only the genus is entered, a pop-up list of matching genus-species-strains is provided for selection of the desired genus/species/strain. If the genus and species are entered, a pop-up list of matching genus-species-strains is provided for selection of the desired genus/species/strain.
The search provides a list of sequences in the RSS. The list provides the UniProt ID (link to the UniProt page for sequence), UniProt description, organism name, UniProt annotation status (SwissProt or TrEMBL), and link to its Explore page.
We would like RadicalSAM.org to be a community resource, with users providing current annotation information that will assist the community with selection of proteins for study and inform the sequence-function space that is used to infer functions.
SwissProt annotations are incomplete and sometimes vague or incorrect; experimentally verified annotations provided by the community are expected to be both more reliable and precise. On the Submit page, we provide a "Community Annotation Submission" form for users to submit experimentally determined functions as well as publications that document/describe these functions. After review, these will be included on the cluster pages using the "Pubs" and "Anno" buttons.
We have identified some proteins in the RadicalSAM superfamily that have already been annotated by the community. These are listed below.