PDB-Dev

Prototype Archiving System for Integrative Structures

User guide

1. Understanding the PDB-Dev Validation Report
2. Overview
- 2.1 Overall Quality Assessment
3. Model Details
4. Data Quality Assessment
5. Model Quality Assessment
- 5.1a Excluded Volume Analysis
- 5.1b MolProbity Analysis
6. Fit to Data Used for Modeling Assessment
- 6.1 SAS: Χ² Goodness of Fit Assessment
- 6.2: SAS: Cormap Analysis
7. Fit to Data Used for Validation Assessment
8. Understanding the Summary Table
9. References for the Validation Report

1. Understanding the PDB-Dev Validation Report

This validation report was created based on the guidelines and recommendations from IHM TaskForce (Berman et al. 2019). The first version of the PDB-Dev validation report consists of four categories as follows:

1.1 Model composition: This section outlines model details and includes information on ensembles deposited, chains and residues of domains, model representation, software, protocol, and methods used. All deposited structures have this section.

1.2. Data quality assessment: Data quality assessments are only available for Small Angle Scattering datasets (SAS). This section was developed in collaboration with the SASBDB community. For details on the metrics, guidelines, and recommendations used, refer to the 2017 community article (Trewhella et al. 2017). All experimental datasets used to build the model are listed, however, validation criteria for other experimental datasets are currently under development.

1.3. Model quality assessment: Model quality for models at atomic resolution is assessed using MolProbity (Williams et al. 2018), consistent with PDB. Model quality for coarse-grained or multi-resolution structures are assessed by computing excluded volume satisfaction based on reported distances and sizes of beads in the structures.

1.4. Fit to data used to build the model: Fit to data used to build the model is only available for SAS datasets. This section was developed in collaboration with the SASBDB (Valentini et al. 2015). For details on the metrics, guidelines, and recommendations used, refer to the 2017 community article (Trewhella et al. 2017). All experimental datasets used to build the model are listed, however, validation criteria for other experimental datasets are currently under development.

A fifth category, fit to data used to validate the model, is under development.

2. Overview

2.1 Overall Quality Assessment: This is a set of plots that represent a snapshot view of the validation results. There are four tabs, one for each validation criterion: (i) model quality, (ii) data quality, (iii) fit to data used for modeling, and (iv) fit to data used for validation.

2.1.1. Model quality: For atomic structures, MolProbity is used for evaluation. We evaluate bond outliers, side chain outliers, clash score, rotamer satisfaction, and Ramachandran dihedral satisfaction (Williams et al. 2018) . Details on MolProbity evaluation and tables can be found here. For coarse-grained structures of beads, we evaluate excluded volume satisfaction. An excluded volume violation or overlap between two beads occurs if the distance between the two beads is less than the sum of their radii (S. J. Kim et al. 2018). Excluded volume satisfaction is the percentage of pair distances in a structure that are not violated (higher values are better).
2.1.2. Data quality: Data quality assessments are only available for SAS datasets. The current plot displays radius of gyration (R_g) for each dataset used to build the model. R_g is obtained from both a P(r) analysis (see more here), and a Guinier analysis (see more here).
2.1.3. Fit to data used for modeling: Fit to data used for modeling assessments are available for SAS datasets. The current plot displays Χ² Goodness of Fit Assessment for SAS-model fits (see more here).
2.1.4. Fit to data used for validation: Fit to data used for validation is currently under development.

3. Model Details

3.1. Ensemble Information: Number of ensembles deposited, where each ensemble consists of two or more structures.

3.2. Summary: Summary of the structure, including number of models deposited, datasets used to build the models and information on model representation.

3.3. Entry Composition: Number of chains present in the integrative structure.

3.4. Datasets Used: Number and type of experimental datasets used to build the model.

3.5. Representation: Number and details on rigid and non-rigid elements of the structure.

3.6. Methods and Software: Methods, protocols, and softwares used to build the integrative structure.

4. Data Quality

4.1. SAS: Scattering Profiles: Data from solutions of biological macromolecules are presented as both log I(q) vs q and log I(q) vs log (q) based on SAS validation task force (SASvtf) recommendations (Trewhella et al. 2017). I(q) is the intensity (in arbitrary units) and q is the modulus of the scattering vector.

4.2. SAS: Experimental Estimates: Molecular weight (MW) and volume data are displayed. True MW can be compared to Porod estimate from scattering profiles, estimated volume can be compared to Porod volume obtained from scattering profiles (Trewhella et al. 2017).

4.3. SAS: Flexibility Analysis: Flexibility of chains are assessed by inferring Porod-Debye and Kratky plots. In a Porod-Debye plot, a clear plateau is observed for globular (partial or fully folded) domains, whereas fully unfolded domains are devoid of any discernible plateau. For details, refer to Figure 5 in Rambo and Tainer, 2011 (Rambo and Tainer 2011). In a Kratky plot, a parabolic shape is observed for globular (partial or fully folded) domains and a hyperbolic shape is observed for fully unfolded domains.

4.4. SAS: P(r) Analysis: P(r) represents the distribution of distances between all pairs of atoms within the particle weighted by the respective electron densities (Moore 1980) . P(r) is the Fourier transform of I(s) (and vice versa). R_g can be estimated from integrating the P(r) function. Agreement between the P(r) and Guinier-determined R_g (table below) is a good measure of the self-consistency of the SAS profile. R_g is a measure for the overall size of a macromolecule; e.g. a protein with a smaller R_g is more compact than a protein with a larger R_g, provided both have the same molecular weight (MW). The point where P(r) is decaying to zero is called D_max and represents the maximum size of the particle. The value of P(r) should be zero beyond r=D_max.

4.5. SAS: Guinier Analysis: Agreement between the P(r) and Guinier-determined R_g (table below) is a good measure of the self-consistency of the SAS profile. The linearity of the Guinier plot is a sensitive indicator of the quality of the experimental SAS data; a linear Guinier plot is a necessary but not sufficient demonstration that a solution contains monodisperse particles of the same size. Deviations from linearity usually point to strong interference effects, polydispersity of the samples or improper background subtraction (Feigin and Svergun 1987). Residual value plot and coefficient of determination (R²) are measures to assess linear fit to the data. A perfect fit has an R² value of 1. Residual values should be equally and randomly spaced around the horizontal axis.

5. Model Quality Assessment

Excluded volume assessments are performed for coarse-grained structures and MolProbity analysis is performed for atomic structures.

5.1a. Excluded Volume Analysis: Excluded volume violation is defined as percentage of overlaps between coarse-grained beads in a structure. This percentage is obtained by dividing the number of overlaps/violations by the total number of pair distances in a structure. An overlap or violation between two beads occurs if the distance between the two beads is less than the sum of their radii (S. J. Kim et al. 2018).

5.1b. MolProbity Analysis: MolProbity analysis for atomic structures reported is consistent with PDB standards for X-ray structures (Williams et al. 2018). Summarized information is available in both the HTML and PDF reports. Detailed information is available for download as csv files, both from the HTML and the PDF reports. Please refer to the PDB user guide for details.

6. Fit to Data Used for Modeling Assessment

Recommendations from SAS validation task force (SASvtf) for model fit assessment include:

All software, including version numbers, used for modelling; three-dimensional shape, bead or atomistic modelling.

All modelling assumptions clearly stated, including adjustable parameter values. In the case of imposed symmetry, especially in the case of shape models, comparison with results obtained in the absence of symmetry restraints.

For atomistic modelling, a description of how the starting models were obtained (e.g. crystal or NMR structure of a domain, homology model etc.), connectivity or distance restraints used and flexible regions specified and the basis for their selection.

Any additional experimental or bioinformatics-based evidence supporting modelling assumptions and therefore enabling modelling restraints or independent model validation.

For three-dimensional models, values for adjustable parameters, constant adjustments to intensity, χ² and associated p-values and a clear representation of the model fit to the experimental I(q) versus q including a residual plot that clearly identifies systematic deviations.

Analysis of the ambiguity and precision of models, e.g. based on cluster analysis of results from multiple independent optimizations of the model against the SAS profile or profiles, with examples of any distinct clusters in addition to any final averaged model.

6.1. SAS: Χ² Goodness of Fit Assessment: Model fits displayed in this section are obtained from SASBDB. χ² values are a measure of fit of the model to data. A perfect fit has a χ² value of 1.0. (Trewhella et al. 2013, Schneidman-Duhovny, Kim, and Sali 2012, and Rambo and Tainer 2013).

6.2. SAS: Cormap Analysis: ATSAS datcmp (Manalastas-Cantos et al. 2021) was used for hypothesis testing, using the null hypothesis that all data sets (i.e. the fit and the data collected) are similar. The reported p-value is a measure of evidence against the null hypothesis; the smaller the value, the stronger the evidence that the null hypothesis should be rejected.

7. Fit to Data Used for Validation Assessment

This includes assessing model fit to data that was not used explicitly or implicitly in modeling. This section is currently under development.

8. Understanding the Summary Table

8.1. Entry composition: List of unique molecules that are present in the entry.

8.2. Datasets used for modeling: List of input experimental datasets used for modeling.

8.3. Representation: Representation of modeled structure.

8.3.1. Atomic structural coverage: Percentage of modeled structure or residues for which atomic structures are available. These structures can include X-ray, NMR, EM, and other comparative models.
8.3.2. Rigid bodies: A rigid body consists of multiple coarse-grained (CG) beads or atomic residues. In a rigid body, the beads (or residues) have their relative distances constrained during conformational sampling.
8.3.3. Flexible units: Flexible units consist of strings of beads that are restrained by the sequence connectivity.
8.3.4. Interface units: An automatic definition based on identified interface for each model. Applicable to models built with HADDOCK.
8.3.5. Resolution: An automatic definition based on identified interface for each model. Applicable to models built with HADDOCK.

8.4. Restraints: A set of restraints used to compute modeled structure.

8.4.1. Physical restraints: A list of restraints derived from physical principles to compute modeled structure.
8.4.2. Experimental information: A list of restraints derived from experimental datasets to compute modeled structure.

8.5. Validation: Assessment of models based on validation criteria set by IHM task force (Sali et al. 2021 and Berman et al. 2019).

8.5.1. Sampling validation: Validation metrics used to assess sampling convergence for stochastic sampling. Sampling precision is defined as the largest allowed Root-mean-square deviation (RMSD) between the cluster centroid and a model within any cluster in the finest clustering for which each sample contributes structures proportionally to its size (considering both the significance and magnitude of the difference) and for which a sufficient proportion of all structures occur in sufficiently large clusters (Viswanath et al. 2017).
8.5.2. Clustering algorithm: Clustering algorithm used to analyze resulting solution.
8.5.3. Clustering feature: Feature or reaction co-ordinate used to cluster solution.
8.5.4. Number of ensembles: Number of solutions or ensembles of modeled structure.
8.4.5. Number of models in ensemble(s): Number of structures in the solution ensemble(s).
8.5.6. Model precision: Measurement of variation among the models in the ensemble upon a global least-squares superposition.
8.5.7. Data quality:Assessment of data on which modeled structures are based. See section 4 for more details.
8.5.8. Model quality:Assessment of modeled structures based on physical principles.See section 5 for more details.
8.5.9. Assessment of atomic segments:Assessment of atomic segments in the integrative structure. See section 5 for more details.
8.5.10. Excluded volume satisfaction:Assessment of excluded volume satisfaction of coarse-grained beads in the modeled structure. Excluded volume between two beads not connected in sequence are satisfied if the distance between them is greater than that of the sum of their radii. See section 5 for more details.
8.5.11. Fit to data used for modeling:Assessment of modeled structure based on data used for modeling. See section 6 for more details.
8.5.12. Fit to data used for validation:Assessment of modeled structure based on data not used for modeling. See section 7 for more details.

8.6. Methodology and software: List of methods on which modeled structures are based and software used to obtain structures.

8.6.1. Method name: Name(s) of method(s) used to generate modeled structures.
8.6.2. Method details: Details of method(s) used to generate modeled structures.
8.6.3. Software details: Software used to compute modeled structure, also includes scripts used to generate and analyze models.

9. References for Validation Report

Berman, Helen M., Paul D. Adams, Alexandre A. Bonvin, Stephen K. Burley, Bridget Carragher, Wah Chiu, Frank DiMaio, et al. 2019. “Federating Structural Models and Data: Outcomes from A Workshop on Archiving Integrative Structures.” Structure 27 (12): 1745–59.

Manalastas-Cantos, Karen, Petr V. Konarev, Nelly R. Hajizadeh, Alexey G. Kikhney, Maxim V. Petoukhov, Dmitry S. Molodenskiy, Alejandro Panjkovich, et al. 2021. “ATSAS 3.0: Expanded Functionality and New Tools for Small-Angle Scattering Data Analysis.” Journal of Applied Crystallography 54 (Pt 1): 343–55.

Rambo, Robert P., and John A. Tainer. 2011. “Characterizing Flexible and Intrinsically Unstructured Biological Macromolecules by SAS Using the Porod-Debye Law.” Biopolymers 95 (8): 559–71.

Sali, Andrej, Helen M. Berman, Torsten Schwede, Jill Trewhella, Gerard Kleywegt, Stephen K. Burley, John Markley, et al. 2015. “Outcome of the First wwPDB Hybrid/Integrative Methods Task Force Workshop.” Structure 23 (7): 1156–67.

Trewhella, Jill, Anthony P. Duff, Dominique Durand, Frank Gabel, J. Mitchell Guss, Wayne A. Hendrickson, Greg L. Hura, et al. 2017. “2017 Publication Guidelines for Structural Modelling of Small-Angle Scattering Data from Biomolecules in Solution: An Update.” Acta Crystallographica. Section D, Structural Biology 73 (Pt 9): 710–28

Valentini, Erica, Alexey G. Kikhney, Gianpietro Previtali, Cy M. Jeffries, and Dmitri I. Svergun. 2015. “SASBDB, a Repository for Biological Small-Angle Scattering Data.” Nucleic Acids Research 43 (Database issue): D357–63.

Viswanath, Shruthi, Ilan E. Chemmama, Peter Cimermancic, and Andrej Sali. 2017. “Assessing Exhaustiveness of Stochastic Sampling for Integrative Modeling of Macromolecular Structures.” Biophysical Journal 113 (11): 2344–53.

Williams, Christopher J., Jeffrey J. Headd, Nigel W. Moriarty, Michael G. Prisant, Lizbeth L. Videau, Lindsay N. Deis, Vishal Verma, et al. 2018. “MolProbity: More and Better Reference Data for Improved All-Atom Structure Validation.” Protein Science: A Publication of the Protein Society 27 (1): 293–315.

Integrative Modeling Validation Package: Version 1.2