The OCHRE Platform

An innovative database platform providing economies of scale via a comprehensive, multi-ontology knowledge graph

The Power of a Shared Platform

OCHRE (Online Cultural and Historical Research Environment) is a multi-media, multi-ontology, graph database system. It has been professionally engineered with a single software code base to achieve large economies of scale. An efficient two-tier architecture minimizes the cost of maintaining and upgrading the software while meeting the needs of diverse projects and publications in all fields of study. The lead developer of OCHRE is Sandra Schloen, an experienced software engineer who is the Director of Technology for CORPUS and Manager of the OCHRE Data Service.

OCHRE has been under continuous development for more than 20 years. It has been rigorously tested for a wide range of use cases in the humanities, the social sciences, and some natural sciences (astronomy, geophysics, paleontology, and population genetics). In principle, it could be used in any field of study.

OCHRE currently manages ca. 150 terabytes of data comprising 10 million indexed database items created for more than 100 different projects, publications, and collections. It has the capacity for thousands of projects and billions of database items.

Until now, users of OCHRE have been based mainly at the University of Chicago. The platform is being made more widely available to researchers elsewhere via CORPUS, whose staff provide user training, data migration, editing, and other support needed for managing and publishing research data.

Contact the CORPUS staff at corpus@uchicago.edu to arrange a consultation for your project.

Supporting All Stages of Research

There are five stages of computational work in a typical research project. OCHRE provides an intuitive user interface that makes it easy to move from one stage to the next without having to write any code, keep track of individual files, or transfer data from one piece of software to another. At all stages, the data remains under the control of the researchers who added it to the platform and can be viewed only by people they authorize.

1. Acquire the Data

The first stage is to acquire the data for the project. This is done by (1) automatically importing existing digital files, (2) fetching data dynamically from online data sources, (3) capturing the output of data-capture devices (e.g., digital cameras, 2D document scanners, 3D laser scanners, etc.), or (4) manually keying in the data — all of which can be done via the back-end user interface of the OCHRE database without having to write any software code.

2. Integrate the Data

The second stage is to integrate data that has been derived from disparate sources and stored in different digital formats. This is done by making the data conform to a coherent project taxonomy that regularizes the names of entities and the relations between them. CORPUS staff work with researchers to clean their data and model it within the OCHRE database while respecting the terminology and conceptual distinctions of each project and without forcing everyone into the same rigid mold.

3. Analyze the Data

The third stage is to explore and analyze the data by means of database queries, statistical analysis, geospatial mapping, network graphing, and other methods of pattern recognition and data visualization. AI capabilities are being added to OCHRE for image searching, text translation, and data visualization.

4. Publish the Data

The fourth stage is to publish the data on the Web along with the results of data analysis and visualization. The decision to publish and the choice of which data will be published is up to the project director. Data published by CORPUS as a formal peer-reviewed publication under the imprint of the University of Chicago will remain permanently accessible on the Web with persistent URL hyperlinks at the individual record level so that scholars and students can cite the publication reliably in their own work, just as they would cite a printed publication. This has been difficult to achieve in academic publishing but is possible via the OCHRE platform.

5. Preserve the Data

The fifth and final stage is to preserve the data, including both published data and unpublished data, so it can be re-used by future researchers. University libraries play an important role at this stage by curating digital data and migrating it to new storage media and formats as technologies change. OCHRE data is hosted on servers managed by the University of Chicago Library, which provides system administration and ensures data security. The data is preserved indefinitely in the Library Digital Repository, a state-of-the-art facility designed to mitigate the risk of data loss. The data is replicated nightly to a different University of Chicago data center with additional copies stored in the cloud. Copies of the data are stored in different geographical locations on different hardware and software and under different management so that the same catastrophic event cannot affect all copies.

Open and Freely Accessible

The OCHRE platform is open and accessible in the following ways:

Open Standards

OCHRE is entirely based on non-proprietary open standards published by the World Wide Web Consortium, namely, XML, XML Schema, XSLT, XQuery, HTML, CSS, RDF, and SPARQL, supplemented by JSON, which is an ISO standard, and IIIF, a set of image data standards published by a global consortium of research libraries.

Open Access

Data that researchers choose to publish via the OCHRE platform is made freely available on an open-access basis for scholarly purposes. Usage of the published data is subject to a license that stipulates non-commercial use with proper attribution to the creators of the data. There is no paywall for end users though in some cases there may be legal restrictions that require access to published data to be restricted to registered users.

Open Source

The JavaScript Web app provided on the front end of the OCHRE platform is open source. The back end of the platform contains a mixture of open-source software combined with proprietary software for which there is no good open-source alternative. This is the normal practice when building high-performance enterprise-class database systems.

A word about open source: Open-source software is obviously desirable whenever it is available and sufficient for the task at hand, in order to minimize financial barriers to access that inhibit non-commercial academic use of the system by scholars and students. But very few people use only open-source software throughout the entire software stack. This is because open-source software that remains usable over the long term is not “free.” Someone has to be paid to maintain it and document it, thus open source alone does not ensure accessibility. A vast amount of open-source software ends up orphaned and unusable, as we have seen over and over again in academic circles when a project’s funding runs out or its leaders retire, causing the website to go dark. This has led academic funding agencies to question whether it is the best use of their resources to pay for large numbers of boutique software applications that end up being unsustainable, whether open source or not. In the case of OCHRE, the cost of licensing proprietary software from commercial vendors is borne by the University of Chicago first of all for the benefit of its own faculty’s research. This benefit is extended to researchers elsewhere at a minimal cost thanks to the economies of scale engendered by sharing a common database platform with a single code base.

History of Development

The OCHRE platform was designed by David Schloen and Sandra Schloen, each of whom has a degree in computer science from the University of Toronto. The richly featured Java application that powers the back end of the platform and provides a user interface for building content in the core database was written by Sandra Schloen, who oversees the technical implementation of all aspects of the platform.

The ontological structure of the OCHRE database in terms of hierarchically nested item-variable-value triple statements about basic classes of entities was conceived in the early 1990s and was implemented in a relational database under a different name (INFRA: Integrated Facility for Research in Archaeology), for which the software was also written by Sandra Schloen. This was done before XML and RDF were invented. After these data standards were introduced by the World Wide Web Consortium (W3C) in the late 1990s, they were used to implement the OCHRE ontology in the schema of a semistructured XML/XQuery database.

As soon as XML 1.0 became a W3C recommendation in February 1998, development of the OCHRE database began and the system became operational in 2001 using an early version of the XML Query language, even though XQuery 1.0 did not become an official W3C recommendation until January 2007. OCHRE has undergone continuous development and enhancement for more than 20 years and has been tested with more than 100 different use cases at the University of Chicago and elsewhere.

Acknowledgments

Testing and refining the OCHRE platform for use by scientific projects as well as the first stage of development of the front end of the platform were done with support from a $1.75-million Scientific Software Integration grant from the U.S. National Science Foundation’s Office of Advanced Cyberinfrastructure (PI: David Schloen; award no. 1450455).

Additional funding for research projects to use and test the OCHRE platform has come from the Neubauer Collegium for Culture and Society and the Institute for the Study of Ancient Cultures at the University of Chicago; the National Endowment for the Humanities; and the Social Sciences and Humanities Research Council of Canada.