OCHRE Ontology

A foundational ontology that can represent many different domain-specific or project-specific ontologies

Dealing with Divergent Ontologies

The term “ontology” in this context denotes a formal specification of conceptual classes and relations. The term “schema” refers to the logical data structures in which an ontology is implemented in a working computer system. A hierarchical taxonomy is a common kind of ontology.

The OCHRE platform was initially developed for use in archaeology and philology. Within these disciplines one finds researchers collecting empirical evidence using the same tools and methods while classifying and describing the evidence in quite different ways. But ontological diversity is not just a characteristic of those particular disciplines. It is found in all fields of study in the humanities and social sciences and in varying degrees also in the natural sciences.

Computational tools for working with the vast body of scholarly knowledge now expressed in digital form must cope with the fact that such knowledge is being recorded using divergent taxonomies that reflect the conceptual distinctions favored in different research communities and perhaps also reflect the idiosyncratic views of individual scholars. No single ontology will be suitable for all purposes. There is an endless array of conceptual possibilities depending on the subject matter and the questions being asked, not to mention the linguistic traditions and historically situated perspectives of the people involved.

It is important to remember that ontological diversity is not a problem in itself. Indeed, it is inherent in the practice of research because different ontologies reflect different interpretive frameworks and research agendas; they are not just the result of sloppy thinking or individual quirks and egotism. Ontological diversity is not a vice to be eliminated, in a misguided attempt to standardize human ways of knowing, but rather a defining virtue of critically minded communities of thought that are open to multiple perspectives.

Standardization and Semantic Authority

The practices of digital knowledge representation that emerged in large governmental and business organizations suppress ontological diversity. This arises from the fact that these are hierarchical organizations with central semantic authorities that mandate standard ontologies to be used throughout the organization for reasons of efficiency and to maintain managerial control.

The vast majority of software development is done within and for governmental and business organizations, so it is not surprising that most software designers assume without question that ontological standards are necessary, even in academic settings. These standards are typically expressed as a single prescribed database schema for each predetermined class of data, or perhaps as a set of prescribed markup tags for texts of a given type.

Unfortunately, these diversity-suppressing digital practices have permeated academic research, even though most scholars lack a central semantic authority. It is true that, in the natural sciences, a standardized ontology will often emerge from a widely shared theory that all (or almost all) researchers accept. But this is not the case in fields of study in which there are competing theoretical frameworks that generate heterogeneous ontologies, which occurs not just in the humanities and social sciences but often enough also in the natural sciences. Rigid ontological prescriptions cause problems in such a situation because they force researchers to use terms and to make conceptual distinctions with which they may not agree.

On the other hand, allowing people to use their own ontologies inhibits the automated integration and comparison of data across projects, which would be of great practical benefit in many kinds of research. For this reason, a mechanism for automated querying that can span multiple divergent ontologies is highly desirable. What is needed is database software that does not suppress ontological diversity via forced standardization but instead embraces it, while also facilitating semantic data integration across ontological boundaries.

This can be done via the OCHRE platform by specifying thesaurus relations among the taxonomic terms found in different ontologies. In this kind of ontology alignment, each pair of taxonomic terms is related semantically using one of the standard thesaurus relations: close match (synonym), broader term, narrower term, or related term. Once specified, thesaurus relations can be used in database queries to do automatic query expansion, retrieving semantically related information from many different projects at once (see the OCHRE Integration page of this website for further details on creating thesauruses to achieve semantic integration).

A Pragmatist Hermeneutics

OCHRE was designed with semantic data integration in mind. It was engineered from the outset to respect the deeply rooted practices of semantic autonomy characteristic of modern academic research by directly modeling each scholar’s own terminology and conceptual distinctions. OCHRE avoids any attempt to impose a standardized ontology and take semantic authority (and responsibility) away from the individual researcher. And when executing a database query, each user can use decide whether and how the taxonomic terms from different ontologies are to be semantically related, either by invoking the user’s own thesaurus or by invoking a trusted thesaurus constructed by someone else.

In this way, OCHRE upholds the pragmatist hermeneutical principle that the meaning of a linguistic expression is not fixed but depends on its use in context (see Perspectives on Pragmatism by Robert Brandom [Harvard, 2011]). Software designed for academic research should acknowledge this principle and support the scholarly demand to have the freedom to describe phenomena of interest in light of one’s own critical judgments, without being forced to conform to someone else’s ontology due to its being inscribed in the very structure of the computer system.

OCHRE achieves this goal by means of a foundational ontology defined in terms of quite general conceptual categories such as space, time, agency, and discourse (see the document on “Ontological Classes of OCHRE Database Items”). This ontology is implemented in the logical schema of a graph database that can model any domain-specific or project-specific ontology within a global schema. The OCHRE database can accommodate any number of local ontologies, preserving their conceptual distinctions, and can be queried efficiently as a single graph of knowledge.

To sum up: OCHRE is very flexible and customizable. It does not force researchers to conform to a predetermined recording system but lets them use their own terms of description. And it does so while providing powerful mechanisms for ingesting and integrating existing data; for querying and analyzing the data; and for publishing and archiving data in a standards-compliant fashion.

Harnessing the Power of Recursion

Archaeologists and art historians study the material traces of human cultures. Philologists and literary scholars study the historical development and interconnections of languages, literatures, and systems of writing. Linguists and philosophers study human linguistic capacities and the structure of language in general.

All these disciplines exhibit, not just ontological diversity, but a high proportion of semistructured information that is best represented digitally by means of open-ended hierarchies of recursively nested entities rather than by means of rigid tables that have one row for each entity and a predetermined column for each property of the entities represented in the table (see the OCHRE Database page of this website).

Organizing data in recursive hierarchies allows the use of powerful recursive programming techniques to search and analyze the hierarchies. There are reasons to think that recursion is a biologically innate feature of the human faculty of language and the basis of a universal grammar that underlies human conceptual capacities, as has been argued by Noam Chomsky and his many followers in linguistics and cognitive science (e.g., Steven Pinker at Harvard University and Ian Roberts at Cambridge University).

In any case, OCHRE’s hierarchical and recursive data model can intuitively and flexibly represent scholarly knowledge of all kinds. And it can do so without sacrificing the power of modern databases because this data model is implemented, not in an unconstrained web of knowledge that cannot be efficiently queried, but by means of highly atomized keyed-and-indexed data objects that conform to a predictable schema, thus enabling semantically rich and efficient queries.

Recursive Spatial and Temporal Hierarchies

Many research projects entail close attention to geographical and chronological variations in the phenomena being studied. Dealing with the spatial and temporal relations among entities requires mechanisms for representing not just absolute locations in space and time, in terms of numerical map coordinates and calendar dates, but also the relative placement of spatial objects or temporal events with respect to other spatial and temporal phenomena. In many cases, representing relative positions in space or time is more important than representing absolute locations.

This is best accomplished computationally by organizing spatial locations and objects and temporal periods and occurrences by means of open-ended hierarchies of recursively nested entities of the same kind — spatial or temporal, as the case may be — with the same structure at each level of the hierarchy regardless of scale. In OCHRE, these are called parthood hierarchies (see the discussion of Parthood Hierarchy items in the document on “Ontological Classes of OCHRE Database Items”) whereas non-recursive hierarchies are called grouping hierarchies.

Recursive Textual and Linguistic Hierarchies

For research projects in which written texts are the object of study, OCHRE has sophisticated capabilities for representing texts written in any language and writing system, modern or ancient. These capabilities allow the many entities of interest within a single text and the relationships among many different texts to be integrated within a comprehensive graph of knowledge, i.e., a single graph database that can be queried to analyze and compare complex textual phenomena.

As with spatially situated units of observation and temporal periods and sub-periods, written texts and the linguistic discourse the convey are modeled by means of overlapping recursive hierarchies. A text, represented by a Text item in the back-end core database, can be broken down into its physical components (e.g., graphemes, characters, lines, pages, etc.), which are represented by a recursive hierarchy of Epigraphic items, and its discursively meaningful components (e.g., morphemes, words, phrases, clauses, sentences, and larger discourse units), which are represented by a recursive hierarchy of Discourse items. The epigraphic hierarchy of text is cross-linked with its discourse hierarchy to represent the ways in which it may be read.

Epigraphic items at the character level can be related to Sign items that constitute the writing system used to inscribe the text. Discourse items at the word or phrase level can be related to Lexical items that constitute the lemmas in a dictionary of the language used in the text (see the document on “Ontological Classes of OCHRE Database Items” for descriptions of these item types).

Thus, the epigraphic and discursive dimensions of a text are carefully distinguished in OCHRE as separate recursive hierarchies that are connected by cross-hierarchy relations between Epigraphic items representing the physical marks of inscription and Discourse items representing the linguistic meanings of the Epigraphic items according to a particular reading of the text. This distinction between the epigraphic and discursive dimensions of a text is necessary for many kinds of scholarly analysis but is muddled in the Text Encoding Initiative (TEI) encoding scheme, for example.

Recursive Taxonomic Hierarchies

In addition to spatial, temporal, textual, and linguistic hierarchies, OCHRE uses recursion in taxonomic hierarchies. Taxonomic variables and values are represented as database items in their own right and can be recursively nested to represent genus-species relations of semantic inheritance between more general descriptions (e.g., this object is made of metal) and more specific descriptions (this object is made of iron).

Taxonomic database items (variables and values) are linked to database items belonging to the other ontological classes (Spatial items, Temporal items, Agent items, etc.) to describe the properties of those items by means of item-variable-value triples, which are functionally equivalent to the subject-predicate-object triple statements of RDF (see the discussion of the Variable, Value, and Taxonomic Hierarchy item classes in the document on “Ontological Classes of OCHRE Database Items”).

Any number of properties can be attributed to any entity by linking items to variables and values. A named relation between two items can be specified using a relational variable, i.e., a relation is treated as a type of property. For textual and linguistic research, OCHRE uses taxonomic hierarchies to organize the variables and values used to describe the grammatical properties of Discourse items, i.e., parts of speech, conjugations of verbs, declensions of nouns, and so on.

Applicable to Any Field of Study

Large collaborative projects in archaeology and philology were the initial test cases for OCHRE and provide examples of its use. But the software methods developed to deal with the spatial, temporal, linguistic, and taxonomic complexity of archaeological and philological data are applicable to a much wider range of research. This is so because the OCHRE software is based on powerful conceptual abstractions expressed in an innovative graph database structure featuring overlapping recursive hierarchies of highly atomized entities.

Accordingly, OCHRE is now being used, not just in archaeology and philology but in a wide range of research in the humanities and social sciences, and also in branches of the natural sciences where spatial, temporal, and taxonomic relations are key concerns, such as population genetics (comparing ancient and modern DNA), paleontology, paleoclimatology, geophysics, and other kinds of environmental research.

Interweaving Texts, Writing Systems, and Dictionaries

In the OCHRE database, writing systems are represented separately from texts that use them to avoid confusion between the ideal signs of the writing system, as understood abstractly, and the physical instantiations of those signs in particular written texts. The epigraphic components of a text are linked at the character level to the signs of a writing system. The discursive components of a text are linked at the word level to lemmas in a dictionary of the language in which the text was written.

Signs and Allographs in Writing Systems

Writing systems are represented as groups of ideal signs, which are represented by Sign items in the database. The ideal signs are distinguished from the allographs of those signs, represented as Allograph items, which are actually inscribed as epigraphic components of empirical texts. Allographs are in turn distinguished from the particular discursive readings (e.g., phonetic values) of an allograph of a sign, which are represented by Reading items.

These distinctions between texts, writing systems, signs, allographs, and readings are necessary when dealing with logographic and logosyllabic writing systems such as Mesopotamian cuneiform, Egyptian hieroglyphs, and (continuing into the modern period) Chinese and Japanese writing systems, whose signs have many possible phonetic reading values and allographic variants. And these distinctions are quite useful even when dealing with alphabetic writing systems, which often have allographic variations of scholarly interest across the texts in which they are instantiated.

Integrating Texts and Dictionaries

Finally, in addition to distinguishing scholarly analyses of the epigraphic hierarchy of a text from analyses of its discourse hierarchy, and also distinguishing the signs of a writing system from the epigraphic units in which these signs are instantiated, OCHRE represents the lexicon of each language or dialect as a separate set of ideal lexical units contained in dictionary lemmas. The lexical units of a language are instantiated by the discourse units of texts written in that language. A word-level discourse unit is normally linked to the epigraphic units that were read to produce it and also to the particular grammatical form of the word within a dictionary lemma.

This allows the software to compile automatically for each lemma all the grammatical forms of the word and all the orthographic and allographic variations in the spelling of each grammatical form of the word, together with textual citations of the use of each form in context generated automatically from the texts in which they appear. OCHRE can thus generate from its database a dictionary view that looks like an OED-style corpus-based dictionary, constructed dynamically from the underlying text editions with no error-prone duplication of information. Text editions are closely interwoven with dictionaries, on the one hand, and with analyses of writing systems, on the other, making it easy to explore computationally the entire web of connections of interest to philology.

Textual Variation and Critical Editions

The value of the OCHRE data model for textual studies is illustrated by the multi-project Critical Editions for Digital Analysis and Research (CEDAR) initiative at the University of Chicago. The CEDAR projects are producing online critical editions of a wide range of culturally influential or “canonical” texts — ancient, medieval, and modern — written in diverse languages and writing systems and transmitted over long periods in multiple copies and translations.

Textual variation in long-lasting textual traditions of this kind can be modeled computationally as a textual “space of possibilities” using OCHRE’s basic model of overlapping recursive hierarchies of entities with cross-hierarchy relations between entities in different hierarchies, which in this case are hierarchies of epigraphic units and discourse units. The ontology of textual phenomena implemented in OCHRE is sufficiently rich and comprehensive to capture the complex conceptual distinctions routinely made by textual scholars.

In contrast, most software for digital humanities conflates the epigraphic and discursive dimensions of a text. A text is typically represented by a single hierarchy of nested components, as in the TEI markup scheme. Unfortunately, this yields an inadequate digital representation of the complex overlapping structure of conceptual entities and relations scholars have in mind when constructing critical editions. For more on this, see the 2014 article in Digital Humanities Quarterly entitled “Beyond Gutenberg: Transcending the Document Paradigm in Digital Humanities” by David Schloen and Sandra Schloen.

CHOIR: A Comprehensive Hierarchical Ontology for Integrative Research

The back-end core database of the OCHRE platform integrates heterogeneous data that has been derived from multiple sources and has been recorded in accordance with divergent ontologies. Each OCHRE database item (i.e., an individual keyed-and-indexed data object represented by an XML document) belongs to one or another of just 20 basic ontological classes. These are described in the document on “Ontological Classes of OCHRE Database Items.”

An RDF network-graph specification of these ontological classes is currently being developed using the Web Ontology Language (OWL), one of the Semantic Web standards published by the World Wide Web Consortium (W3C). This will result in a formal specification of the OCHRE database structure in a standard format that is not dependent on the OCHRE software and XML document types.

The OWL version of the OCHRE ontology has a different name. It is called CHOIR (Comprehensive Hierarchical Ontology for Integrative Research) to distinguish it from the implementation of the same ontology within the OCHRE database schema. The OWL-CHOIR ontology specifies classes, sub-classes, and relations that correspond to those found in the OCHRE database.

This ontology specification will be released in the future with accompanying annotations and examples of RDF triples that conform to it. It will define the structure and meaning of sets of RDF triples exported from the OCHRE database as an independent loss-less archive of the contents of that database.

OWL ontologies are often used in this way to specify the semantics of a set of RDF triples that conform to a given ontology. RDF triples represent subject-predicate-object statements of knowledge, so triples that conform to the OWL-CHOIR ontology specification are easily mapped onto the item-variable-value triple structure of the OCHRE database. An RDF archive (saved in a triplestore) that has been exported from the OCHRE core database will thus preserve the multi-dimensional graph structure and high degree of atomization within that database.

Ontological Classes of OCHRE Database Items

Theoretical Background of the OCHRE Ontology