OCHRE Integration

Organizing and integrating all kinds of data for efficient viewing, querying, and analysis on the Semantic Web

by J. David Schloen and Sandra R. Schloen (last updated December 2025)

Integrating All Kinds of Data

The OCHRE platform provides powerful mechanisms for organizing and integrating many different kinds of data, both within individual projects, publications, and collections and across them. After this has been done, researchers have a comprehensive view of the entities of interest and the relations among them.

OCHRE supports the full range of digital formats and data types, including texts, images, audio, video, and geospatial mapping data.

Alphanumeric Data

A project’s textual and numeric data can be automatically imported into the back-end core database from external source files, e.g., untagged plain text files, CSV files, Excel XLSX spreadsheet tables, Word DOCX documents, etc. The data in each source file is parsed out and atomized into many small keyed-and-indexed database items that are represented in the database as XML “documents” (see the OCHRE Database page of this website).

For example, every cell of a spreadsheet table and every word or character in a textual document would be atomized into its own database item. In the process, the many atomic items are interlinked to create a comprehensive knowledge graph with no redundancies or inconsistencies in the data. The resulting graph database can be queried to produce many different views of the data and is thus functionally equivalent to a highly normalized relational database.

Images and Other Resources

In contrast to a project’s alphanumeric data, its 2D images, 3D models, geospatial mapping data, PDF documents, and audio and video files are not stored directly in the core database but are accessed from resource-file servers (see the OCHRE platform diagram for the distinction between the core database and the resource servers within the back end of the platform).

Images and other resource files are catalogued with their metadata as Resource items in the core database but are stored elsewhere to be fetched as needed via an HTTP or FTP server. The URL of each resource file is stored within the corresponding Resource item in the core database (see the discussion of Resource items in the document on “Ontological Classes of OCHRE Database Items”).

Resource files are fetched dynamically and displayed seamlessly together with the relevant alphanumeric data, which is stored internally as XML in the core database. Any kind of database item can be linked to one or more Resource items. For example, a Spatial item representing a physical object can be linked to multiple Resource items corresponding to images of that object. A link can be made to the resource file as a whole or to a specific location within an image, map, audio or video clip, or PDF document.

On-Demand Retrieval from Other Databases

Researchers do not need to import all the data for their project into the OCHRE database if there are suitable online databases with REST API’s that contain relevant data. These can be used as live data sources from which data is fetched as needed and seamlessly displayed with the data stored in the OCHRE database.

In many fields of research there are online digital repositories in which data is curated and made accessible to the research community. OCHRE users can link to these data sources in a highly granular fashion, at the individual item level, although the granularity of the external data to which an OCHRE item can be linked will depend on the capabilities of the external database’s API.

For example, OCHRE can link to the Zotero database of bibliographic information at the level of an individual bibliographic entry for a book or article listed in a bibliography. For geospatial raster and vector data, OCHRE is linked to ArcGIS Online, which stores georeferenced maps and is used in conjunction with the ArcGIS Maps SDK to provide a powerful mapping and spatial analysis capability that is tightly integrated with individual OCHRE database items.

Most online digital repositories for academic research have a single, relatively simple schema. They are typically flat-file databases consisting of one or more tables with rows and columns, or perhaps a collection of tagged-text documents that conform to a document schema (e.g., the TEI schema) — or even untagged plain-text documents with no schema. In some cases, a repository may use a more sophisticated relational database schema that has been properly normalized. However, very few digital repositories can accommodate multiple heterogeneous schemas within a single queryable database, as OCHRE does.

Data Warehouses and Data Lakes

Sometimes a digital repository will have multiple schemas, but only because it is not a single queryable database and merely consists of multiple flat-file databases or tagged documents, whose original schemas it preserves unchanged. The term data lake is used for a multi-schema repository of this kind to distinguish it from a data warehouse. A data warehouse has a “global” schema that integrates the “local” schemas of the original data sets so it can be efficiently queried as a single database.

The OCHRE database is a data warehouse, not a data lake, because it integrates many local schemas based on different project ontologies within a single global schema that is based on an abstract foundational ontology. OCHRE thus has a very different design than, for example, the digital repositories maintained by the Archaeology Data Service in the U.K. and The Digital Archaeological Record (tDAR) at Arizona State University, which are data lakes, not data warehouses. These repositories accession and store idiosyncratically organized data contributed by many different researchers while preserving the original table schemas and document schemas.

Multi-schema repositories of this sort can be useful but they lack semantic integration and so have significant limitations. They are oriented toward searching and downloading separate data tables and documents, one by one. Unlike OCHRE, they cannot perform comprehensive automated querying that spans the data of many projects to retrieve record-level data in a highly granular way.

In contrast to both traditional single-schema databases and multi-schema data lakes, the OCHRE platform provides a highly integrated data warehouse in which the heterogeneous data and metadata of many different projects are atomized and organized within a single graph of knowledge. OCHRE does so by means of a global graph database schema that faithfully preserves the local ontologies inherent in the original data tables and documents. This process of atomization and recombination within a single database permits much more powerful forms of automated integration, querying, and analysis of data across projects.

Semantic Integration via Thesauruses

In addition to the challenge of integrating and keeping track of many different kinds of data derived from different sources, scholars face the challenge of semantic integration, i.e., retrieving similar things that have been described differently by different researchers. The ability to do this in an automated way is highly desirable in view of the large amount of data needed for many kinds of research, which would be cumbersome if not impossible to search by manual means.

However, the ontological diversity found in most fields of study hinders this effort. OCHRE helps researchers achieve cross-project semantic data integration in the face of this diversity by making it easy to construct semantic mappings from one ontology to another without requiring any additional coding. This is done in the back-end user interface of the platform by specifying thesaurus relations between the taxonomic variables and taxonomic values of different project ontologies (see the description of Variable items, Value items, and Taxonomic Hierarchy items in the document on “Ontological Classes of OCHRE Database Items”).

In this kind of ontology alignment, each pair of taxonomic terms is related semantically using one of the standard thesaurus relations: close match (synonym), broader term, narrower term, or related (associated) term. Thesaurus relations can also be established that link the terms used in project ontologies to external controlled vocabularies that provide a lingua franca between projects, as is discussed below. Once specified, thesaurus relations can be used in database queries to do automatic query expansion, retrieving semantically similar entities from many different projects at once. (More information about database queries can be found on the OCHRE Analytics page of this website.)

The Role of AI in Semantic Data Integration

Artificial intelligence in the form of statistical machine learning using large language models (LLM’s) can facilitate semantic data integration by automatically generating thesaurus relations between taxonomic terms (variables and values). More generally, AI-enabled systems can search a corpus consisting of natural-language texts or more structured data sets to find linguistic expressions (words, phrases, or sentences) that semantically match another linguistic expression that serves as a “verbal prompt” for the search. The accuracy of this kind of AI-enabled semantic matching has improved greatly as LLM’s have become larger and more capable.

AI-assisted mechanisms are currently being added to the OCHRE platform to do both matching and extraction. Starting with a linguistic expression (a verbal prompt or query expression), OCHRE will find other linguistic expressions in a corpus that semantically match the prompt, distinguishing close matches from less close but semantically related expressions. After the matching expressions have been found, they are extracted from the corpus so they can be preserved and re-used as reliable, curated data in the OCHRE database. Extraction is done by linking the Discourse items, Variable items, and Value items that represent these linguistic expressions to a Dictionary item that organizes all the matching linguistic expressions under a lemma (see the description of Discourse items, Variable items, Value items, and Dictionary items in the document on “Ontological Classes of OCHRE Database Items”).

AI can thus help to automate the process of constructing thesaurus relations between the taxonomic variables and values of different projects, which can then be used to expand queries, as described above. Thesaurus relations among taxonomic terms can be stored in Dictionary items. More generally, dictionaries of concepts and “common-place expressions” of the kind used in legal, literary, and historical studies can be created from large corpora of natural-language texts. The same mechanism can be used to assist the construction of standard philological dictionaries of individual words.

However, the possibility of generating false semantic matches due to AI hallucination remains a serious problem. AI-generated matches are not entirely trustworthy and need to be validated by a human expert. OCHRE makes it easy to inspect the proposed equivalences and either reject them or store them permanently as curated information in the OCHRE database. (See the discussion of AI and semi-automated semantic integration in the document on “The Theoretical Background of the OCHRE Ontology.”)

It is important to remember than any semantic mapping from one use of a natural human language to another, whether it is done manually by a human user or is generated automatically by an AI system, is a context-specific and time-bound product that is open to debate and disagreement. Different people often have different ways of understanding taxonomic terms and other linguistic expressions, and so produce different semantic mappings. Indeed, a thesaurus or a dictionary of any kind is itself a work of scholarship that should be attributed to its author, whether the author is a human being or is a particular version of an AI system — keeping in mind that the probabilistic nature of this kind of AI means that different AI systems will generate different results and the same AI system may generate different results at different times.

Ultimately, the human users of a computer system remain responsible for the semantic equivalences they invoke, even when those equivalences are generated by AI. OCHRE allows the creation of multiple thesauruses and dictionaries (semantic mappings) that can be named and attributed to their authors, be they human or artificial. When executing OCHRE queries, users are asked whether they want to invoke a thesaurus for query expansion, and if so, which one. This ensures that the human users and not “the computer” remain responsible for the semantics of their investigations.

Controlled Vocabularies

Several classes of OCHRE database items can be linked via thesaurus relations to external controlled vocabularies such as WikiData and the Getty Vocabularies. This can be done for the following item classes: Agent, Spatial, Temporal, Text, Resource, Concept, Variable, and Value (see the document on “Ontological Classes of OCHRE Database Items”).

A database item in any of these eight classes can be semantically linked to one or more external terms or concepts in controlled vocabularies that have been published on the Web. If a SPARQL endpoint is available for the vocabulary, OCHRE generates a SPARQL query to find the URLs of published concepts that could be linked to a given database item.

If desired, the external term can be substituted in the OCHRE user interface as the name of the item instead of using a project-defined name. This will often be appropriate in the case of a close semantic match, allowing projects to employ standard terms curated by reputable organizations in various domains of research, such as the Getty Research Institute in the domain of cultural heritage.

External semantic linkages to controlled vocabularies clarify the meaning of terms used by OCHRE projects and provide interoperability with other systems. They solve the problem of homographs (i.e., words that have the same written form but different meanings, such as “light” in weight versus “light” in color). And they allow OCHRE projects to employ any language, not just English, and translate their terms using standard terminologies.

More generally, semantic linkages to controlled vocabularies facilitate querying within the OCHRE environment across the data of multiple projects that use different taxonomies. If each OCHRE project links its terms to an external controlled vocabulary, an OCHRE database query can easily retrieve similar items that have been described differently by different projects.

Alternatively, an OCHRE project can borrow a taxonomy or even just one branch of a taxonomic hierarchy from another project, as long as the taxonomy has been made public for general use. This provides another way to achieve semantic integration among different research projects.

RDF Triples and the Semantic Web

The OCHRE platform is integrated with the Semantic Web. Data contained in the back-end core database can be exposed and archived using the Resource Description Framework (RDF) data format and the SPARQL querying language, which are the basis of the Semantic Web.

RDF represents knowledge in the form of subject-predicate-object “triples.” Each triple constitutes a statement of knowledge (a predication). In terms of mathematical graph theory, a collection of RDF triple-statements is a labeled, directed graph. A graph database consisting of RDF triples (a triplestore) can be queried using SPARQL.

RDF is useful for exporting OCHRE data in a standard format that preserves all the conceptual distinctions researchers have made when entering their data into the core database. RDF triples can be implemented in a number of different syntactical forms (e.g., in XML notation or Turtle notation) and do not depend on any particular software or operating system, so an RDF archive exported from the OCHRE database does not depend on the OCHRE software.

For example, in the realm of cultural heritage, OCHRE can export RDF triples that conform to the Europeana Data Model (EDM), a domain ontology for metadata concerning cultural materials of all kinds. More generally, RDF that conforms to any specification can be exported by mapping data from OCHRE’s highly abstract foundational ontology to another ontology. For a discussion of the foundational ontology of OCHRE in relation to domain ontologies like EDM, see the document on “The Theoretical Background of the OCHRE Ontology.”

OCHRE can easily generate RDF because it stores data in a structurally identical way, as subject-predicate-object triple statements about entities of interest, although in OCHRE these are called item-variable-value triples, a terminology and data model used by the designers of OCHRE long before the invention of RDF. Thus, OCHRE is fully compatible with the Semantic Web and the Linked Data approach to knowledge representation based on the Semantic Web standards.

Ontological Classes of OCHRE Database Items

Theoretical Background of the OCHRE Ontology