The root of the problem – Ontology, Epistemology and the problem with Vocabulary




November 24, 2021


Are vocabularies the right way to achieve integration, sharing and reuse in cultural heritage?

Within cultural heritage vocabularies are seen as the main way to integrate data across organisations despite the fact that different organisations use different vocabularies to support their different historical perspectives. In the past ‘union’ systems projects have meant agreeing on a common vocabulary and thereby removing those key distinctions.

In Linked Data, the use of vocabularies for integration using intermediate vocabularies which act as single reference point is still the technical norm, but here too vocabulary terms that are independent of the context in which they are used and where there may be no exact match requiring approximations, are not necessarily appropriate. At the end of the day organisations use vocabularies which rely on their context and support their particular epistemology.

As a result CH professionals in the 1990s started to question whether vocabularies were the right way to achieve integration, and ultimately sharing and reuse.

Also in the 1990s computer scientists provided a clear definition of something called Knowledge Representation. It was based on two aspects. Firstly, it required ‘ontological commitment’. This means that it takes surrogates of things, both material and immaterial, from the real world, i.e.,

there must be some form of correspondence specified between the surrogate and its intended referent in the world; the correspondence is the semantics for the representation.

Knowledge representation in cultural heritage

Knowledge representation relies on making a connection between “problem solving expertise” and “the domain” in which this problem solving takes place. The key element was the latter, providing a solid framework of reality, or at least a part of it which was relatively clear and universal, which formed the basis for this new approach. This provided a clear departure from terminology and instead created general points of reference from universally and scientifically understood concepts. This correspondence is crucial because each knowledge project operates in its own epistemological bubble and can not be fully integrated, shared, and reused using its own particular language.

Despite this understanding within computer science, ontological commitment is rarely developed as part of knowledge engineering, yet many computer scientists turned their attention to projects involving knowledge outside their domain with an interest in how computers could ‘reason’ with different types of domain data. A professional split occurred because developing an ontological model of a domain of interest necessitated a significant undertaking involving collaboration with many different domain subject experts who could validate the ontology through empirical means.

Computer Scientists more interested in Artificial Intelligent reasoning than sharing and reuse, started distancing themselves from domain ontologies:

We prefer to use the term “schema” rather than “ontology” to stress the fact that domain theory is the product of knowledge engineering and thus does not necessarily describe an inherent structure in the domain (as the word “ontology” would suggest).

The lack of commitment to help construct empirical domain ontologies, compared to schemas (see below), has created a serious problem for the development of sophisticated intellectual digital tools that are meaningful, shareable and reusable in these domains and that do not reinforce existing issues. There is now a legacy of integration projects in cultural heritage and the humanities costing many millions in different world currencies over the last 30 years with little progress in representing interconnected knowledge which is held by community and subject experts, not documentation systems.

Creating information without ontological references creates a low level of sustainability and efficiency. Vocabularies are epistemological, not ontological. They are abstractions of the mind, and terminology, particularly in History and Cultural Heritage, are not sustainably reconcilable, and in many cases it is not beneficial to use them this way. Even if a universal taxonomy is produced as an intermediate means of integration, this tends towards artificiality, lack of diversity, reductive datasets and ongoing data management issues limiting the type of complex data representation that would otherwise make computers more useful to knowledge workers and scholars. This has been evidenced by many European projects. It is not readily apparent to many of us that computers aren’t already doing this because we live in what seems like a networked world, but when you delve deeper the quality and extent of these connections are exposed as lacking.

Knowledge representation is concerned with capturing ‘knowledge’, rather than the basic fields of intrinsic information that databases produce. Knowledge representation models represent more sophisticated patterns of information not available in databases which is one reason for the limited scope of use and lack of sustainability of databases outside core organisational functions. Databases have no semantics and cannot represent the relations that are inherent in the real world. The real digital network hubs are not the servers, infrastructures, vocabularies, fixed models or functions, but carefully constructed ontologies which are scientifically founded. Historians of all types have a commitment to uncovering the reality of history using empirical methods supporting interpretative thinking, but their textual narratives lack the ability to create a synthesis of different vantage points at different levels of generality.

Parallel to this ongoing issue in computer science, cultural heritage institutions adopted computer database systems at a relatively early stage (the first conference on Cultural Heritage and Computing was in 1968) and realised that computers and networks could be used to unite knowledge across different institutions and create a totality that was much greater than the sum of its parts. One of the main benefits of a cultural heritage sector is (or should be) to maintain knowledge about human history which is interconnected – not separate and fragmented – which explains the constant investment of money into this problem.

However, databases are designed to solve discrete problems and unlike knowledge representation approaches, do not incorporate semantics. Different museums, archives and galleries invested in new computer technology and software. It gradually became clear that even if they used the same software, the content, structure, conventions and terminologies would be different. Over a period of 30 years, without reference to the debate on knowledge representation in computer science, many cultural heritage professionals, often not subject experts, decided that the only way to integrate cultural heritage data systems was to create standard sets of vocabularies that all museums adhered to. However, computer scientists, whether they followed the ontological road, or deliberately concentrated on reasoning using epistemological data without an interest in wider networks of connected knowledge, all already knew that epistemological integration was practically impossible. There is ample evidence for this from project archives like the EU project database CORDIS.

The vocabulary approach was flawed because it again relied on artificial concepts. Over the last 30 years large amounts of public investment into many different systems and technologies (all based on databases or database mindsets) using artificial modelling and Western vocabulary integration has provided no great innovation. The SKOS standard explicitly says:

The fundamental element of the SKOS vocabulary is the concept. Concepts are the units of thought …—ideas, meanings, or (categories of) objects and events—which underlie many knowledge organisation systems… As such, concepts exist in the mind as abstract entities which are independent of the terms used to label them.

Vocabulary, however useful, and whatever the schema being used, is not a sustainable mechanism for data integration and reuse in the humanities. There is no feasible way that you can orchestrate an integration of ‘concepts of the mind’ across a complex sector dealing with historical information because there is no such agreement within the community of subject experts. The greater the attempt to standardise cultural heritage terminology the greater the level of reduction and simplification which undermines any meaningful long term use and therefore sustainability. The experience of standardised vocabulary in Libraries, which operate in an ecosystem in which bibliographic information is part of a commercial publishing ecosystem, has had a devastating effect on the development of knowledge aids across publication material. There is constant criticism regarding the usefulness of these schemas to users of libraries which force libraries to provide user interfaces that have no way of navigating the books they catalogue through meaningful relations and rely on basic meta data, something that could be easily added by the users (readers) of the library themselves. Yet attempts continue in the rest of the Cultural Heritage sector to follow this model.

In today’s use of Linked Data, the word ‘ontology’ is used widely but most computer ontologies are not ontologies (as originally defined) at all, despite the conflation by computer scientists. Under this qualification they are schemas which use specific epistemological constructions without any empirically validated base. In other words these schemas are artificial constructs just as database models are, but using semantic structures. While it is not the case that knowledge of the real world is confined to pure empirical evidence and that appearance and experience can be both concealing and revealing requiring a combination of matter and mind, more care and caution is required in representing this type of information in a computer ontology which, of course, has its own specific limitations to support reuse and sharing. However, connections between the ontological and the epistemological can be supported by ontologies like the CIDOC CRM.

In the cultural heritage world the same mindset of the knowledge engineers who didn’t want to go to the efforts of building a domain ontology is apparent from computer scientists assuming a leading role in the construction of cultural heritage information systems. For example, at the JISC Discovery Summit in 2013 the computer scientist on the expert panel stated this:

…developers are impatient and just want to get access to the data and do interesting things, and on the other side of the equation we have curators reasonably concerned about how that data is going to be used or misinterpreted or used incorrectly. I think that this is actually a difficult area because the conceptual reference models are generally more popular with the curators than with the developers […It is not] clear to me … how we solve the problem of engaging the people who want to do the development…through that mechanism, but nonetheless as this great experiment that we are living through at the moment with opening up data and seeing what can be done […]unfolds, if we find that the risks are starting to become too great and the value is so poor because the data is being misused or used incorrectly or inappropriately, if that risk is a risk to society in general and not just to the curators…then we are going to have to find those kind of solutions.

JISC, 2013

This is 10 years after computer scientists already understood the risk and after numerous projects that realised it. This statement underlines the difference in agenda between computer science and the other domains in which computer scientists assert reasoning and AI experiments or simply play around with it under the premise that this will produce a magical transformation of data which has existing flaws – both content and structure. The computer scientists (the original knowledge engineers) were right – the construction of a domain ontology based on mature scientific concepts and empirical evidence, particularly in the humanities and cultural heritage, is a massive undertaking. However, while computer scientists have not been willing to make such a collaborative commitment, not helped by academic project culture, the Cultural Heritage Sector itself has delivered this work through the CIDOC CRM!

The CIDOC Conceptual Reference Model (CRM)

The CIDOC CRM, driven with an unprecedented commitment and motivation by Dr Martin Doerr and others on the CRM Special Interest Group, is now a mature and comprehensive empirical domain ontology developed within ICOM (International Council of Museums). It is one of the most important and significant pieces of work within cultural heritage and beyond supporting the progressive use of computers in the humanities, and while its uptake increases year on year, and it is used in many cultural heritage institutions, both large and small, its full significance is not fully understood by organisational leaders or, ironically, by computer scientists despite it adherence to the original principles of Knowledge Representation. It provides the opportunity to address the significant issues with existing documentation, including issues of diversity and inclusiveness.

The whole point of the CIDOC CRM is not to assert a particular epistemological position – but to provide an ontology framework for supporting, evaluating and integrating different ‘knowledges’. The computer Industry and computer science has sought to impose a particular way of doing computing by asserting universal programming languages that are far from universally beneficial or neutral based on a particular dominant cultural and economic position (not least in serving the limitations of the QWERTY keyboard). This computing paradigm also determines data forms and structures that could be used by anyone, but which were in reality designed for business not human history and society. However, domain and community experts can take back control.

It should now be unacceptable for computer scientists to work in the cultural heritage domain without referencing an empirical domain ontology. Its avoidance, often because computer scientists (and CH professionals) now don’t want to take the time to understand the ontology, means that new projects can only end in fragmentation, inadequate reductions and obsolescence. By not employing an ontology constructed according to the principles that computer scientists themselves set down for proper domain knowledge representation in consultation with domain experts, scarce resources are not being applied effectively to represent and relate the full diversity and richness of cultural heritage. We are still trying to build a publication and search system for legacy data, rather than delivering systems that work with people.

ResearchSpace has grown out of the CIDOC CRM. In adhering to the principles of knowledge representation, incorporating both ontological rigor and domain expertise, systems can now be created that provide greater intellectual value combined with effective and efficient operation – these two aspects are completely mutual. The cultural heritage sector has achieved what computer scientists have not – a workable and high quality domain model of reality out of necessity. However, while cultural heritage professionals have always worked on the basis of empirical evidence and conceptual references, their experience of computer systems has been based on a limited top down model. ResearchSpace and the CIDOC CRM allow a new approach enabling professionals to design systems according to their own systems of knowledge – which is what knowledge representation is meant to address.

In summary

  1. The process of selection is inherent in both textual narratives and data modelling and both have subjective influences.
  2. Data is not transparent because it lacks semantics and context and is infrastructurally reductive which itself can harbour bias and lack of diversity. It mostly resides in the background hidden from critique.
  3. Text is not transparent because it is hard to make systematic comparisons with different interpretations using different rhetorical devices, even if it is linked to empirical evidence.
  4. Databases use artificial models and vocabularies making them difficult to integrate without further reduction.
  5. Text is detached from, and doesn’t fully represent, the investigative process hiding the process behind the thinking.
  6. Data is detached from, and doesn’t represent the interpretive process, and therefore generally lacks sustainability beyond organisational function. Organisational data systems based on function are detached from each other.
  7. Text is a dominant and accessible from of communication but it’s hard to create any accessible provenance of knowledge and understand transformation and change.
  8. Data is less accessible but increasingly underpins many digital communication infrastructures. It has no significant history built into it reducing its general social value.

Suggested solutions

  1. Peg data to an ontology framework not a vocabulary framework.
  2. Provide formal semantics in data which can be read by computers and humans.
  3. Provide the ability for data to include interpretation and explanation aligning it with investigation and empirical evidence.
  4. Use contexts as an integrating method not vocabularies.
  5. Use contextual integration as a way of analysing different vantage points with the help of computers to evaluate similarity and difference.
  6. Incorporate space and time concepts to support the analysis of social transformation over time.
  7. Link this contextual data to textual narratives.


Randall Davis, Howard Shrobe, Peter Szolovits (1993), What Is a Knowledge Representation? (

Schreiber, G., Wielinga, B.J., & Breuker, J. (1993), KADS : a principled approach to knowledge-based system development.

About the Author