Linking Legacy Cultural Heritage Data is not Enough for Representing ‘Knowledges’
Dominic Oldman
November 16, 2021
In 2009 the Andrew W. Mellon Foundation proposed to cultural heritage and academic organisations a concept called a ‘shared infrastructure’ to support digital projects. After funding various digital projects the Foundation realised that a large proportion of project time was used to set up completely new digital infrastructure which meant less time could be spent on the research and the subject matter of the project. At the end of these projects there were no means by which researchers could practically share their digital methods or findings in a meaningful way, or build on it with content from other projects.

Charles Bachman’s integrated database.
In any event, the form and structure of databases meant that data was narrow, applying the inappropriate language of data processing to human history and material culture. It excluded the possibility of wider structured narratives and instead reinforced and protected an established Western categorisation perspective. Digital projects maintained fragmentation and prevented the creation of a wider provenance of knowledge either within or across organisations. It became clear that these limitations, along with other factors, prevented knowledge from wider sources and therefore excluded other knowledges.
The shared infrastructure idea, which came from a collaboration within the Foundation, was presented to Museum Directors, by Angelica Rudenstine, Alison Gilchrest and Ken Hamma. It was not about creating homogenised projects but about establishing a rich underlying platform on which different projects could develop their research independently according to their research questions without deterministic technology, but at the same time could share intellectual digital outputs effectively supporting community knowledge building. Rather than storing references it would support the methods of research and everyday knowledge processes by addressing and interconnecting investigation, interpretation and presentation. Within a contextual framework, supporting the full cycle and ongoing nature of knowledge generation, it would tackle the problem of throw away digital project systems.
Sustainability depended on building quality and meaning into data, not on the information reductions and a one size fits all mentality fostered by traditional database systems. The Mellon Foundation recommended the use of Linked Data but to do this effectively it required a project to research and understand how RDF (Resource Description Framework – the language of Linked Data) could be used to support intellectual freedom, contextualisation and provenance, as well as support knowledge integration and collaboration. The resulting project was ResearchSpace.
Linked Data is a flexible structured data format. It is used in cultural heritage (CH) and the humanities but only a small proportion of people understand what it is and how it works despite its presence on the fringes of many organisations for over a decade. While Linked Data holds great promise, CH and humanities Linked Data projects end up producing disappointing results. The lack of expertise and understanding means that these projects are guided by technology companies with no knowledge of CH processes or their knowledge systems. Domain knowledge is crucial to meaningful and effective Linked Data (or rather Semantic Web) implementations because its real benefits lie beyond the traditional database and its pigeon hole mindsets. These benefits relate directly to representing knowledge processes, and the information that they produce, dynamically – the way people actually work and generate information – and not as a static reference product. A Linked Data project, or more accurately a Semantic Data project is, primarily, not a technology project.

A ResearchSpace knowledge map (authoring tool) of relations and influences related to Dox Thrash.
As such many institutions end up with results that fall well short of the hype and are left with a different version of the same old database and data – just with additional links. Approached as an IT solution, often using proprietary software and designed from an IT perspective, Linked Data becomes the complete opposite of what Semantic Web principles are about, which is domains constructing their own systems of knowledge representation for the Web to generate an effective digital communication strategy with computer readable data. If Linked Data is applied to a particular problem as a business solution it will ultimately fail and this is part of the reason why Linked Data and the Semantic Web have not made inroads in a sector where it should be having significant impact.
This means that the wider context – the information that provides the wider historical significance and relevance of, for example, an historical object, is not included in the data. These omissions are reinforced by the reductive structure and form of databases which were designed for commercial data processing and not the multi-layered and multi-causal nature of human history.
Institutional collection data is based on a documentation perspective derived from a Western modernist tradition which lacks, by design, appropriate contextualisation. It strips out important societal information in the name of an unsubstantiated neutrality. This is not acceptable in other mediums, like textual and visual narrative, but data is provided with an exemption, presumably because of its lack of perceived value, one dimensional characteristics, and throwaway quality, despite its ubiquity and its underpinning of digital information services and social media. At its instigation it was never designed for public engagement yet these secondary indexes have been promoted to primary public information sources online, divorced from subject experts who had previously supplied at least some of the conceptual references (context). Simply connecting this data using Linked Data (linking the pigeon holes) doesn’t come close to addressing these omissions.

ResearchSpace contextual search based on Dox Thrash.
The real power of Linked Data, with the appropriate scientific and semantic frameworks, is to empower subject and community experts. To transfer their knowledge into a Web environment with the ability to expand structured data and create meaningful knowledge bases that represent wider sources of knowledge and different vantage points. It provides the richer structures and forms missing from databases, which are necessary to transform how data is authored and presented and allow data narratives (stories in data) that are inclusive but can be empirically validated.
The power of Linked Data is not the technical implementation of perfunctory data connections by technology companies using the rhetoric of Tim Berners-Lee, but delivering something very different. This is evident after more than 20 years of the Web of Data without any significant progress or progressiveness in the diversity of institutional data. It is in the ability for subject and community experts (in the widest sense) to take control of the modelling of data from database administrators and programmers to incorporate a wider scope of meaningful inter-dependencies. This needs the development of appropriate user tools with that design vision.
This objective, running contrary to the agenda of technology companies, has been a far harder challenge and has required the efforts of many different people from different backgrounds over many years to create the essential framework for meaningful representation of knowledge that can be shared and reused, and continually built upon by the community. However, as soon as domain experts create innovation based on their own knowledge systems, technologists undermine it because of a combination of their position of technical power, lack of the provisions of user tools, a lack of understanding and motivation, and increasingly a protection their commercial imperatives and funding. The challenge is to change the current environment and mindsets so that the decades of this effort is not undermined.
CH organisations are complicated environments but are underpinned by fundamental knowledge processes which if aligned could provide a significant ‘intellectual infrastructure’ on which to base digital strategy relevant to more audiences. CH organisations are not resourced to invest in expensive and hard to maintain custom database integrations. In any event these expensive IT integrations perform relatively simplistic data processing and applied to complex historical data patterns wouldn’t work anyway. Instead Cultural Heritage organisations have purchased a mixed bag of proprietary information systems which have resulted in greater fragmentation and impeded organisational wide innovation despite the availability of world leading experts and unique historical information. Linked Data, encapsulated within the appropriate knowledge representation framework, allows organisations to effectively and efficiently reunite organisational processes separated by function based database systems and approaches. An engine for wider engagement and collaboration.

User designed observation system in ResearchSpace.
CH Linked Data has been mostly used as a flexible data publishing medium to publish and link existing documentation data and information. This misses the full potential of the Semantic Web which provides an environment for expanding and representing new sources and categories of knowledge. These publishing solutions can never address the back office issues which maintain old processes, systems and standards. This back office problem has forced projects into detached digital projects, avoiding the need to reform internal systems and undermining any sustainability which can only come from a full alignment of processes. While new projects address diversity and inclusion using different methods, collection online systems stand out as a regressive medium storing and perpetuating historical bias.
For example, the Provisional Semantics project, part of the TaNC (Towards a National Collection) program has identified the gap between the many museum projects reviewing the provenance of their collection objects, and their collection online systems. This gap illustrates the way that collection systems have become entrenched, and while periodically their user interfaces are refreshed, their content remains fixed, accept for adding on some more text. The structures don’t support it. Suggestions that museums simply find other online databases which present data missing from their own online offering and link their legacy data to these, for example, academic projects addressing social or environment issues, ignores both the temporality of these projects (an inherent problem of academic projects) and the existence of significant internal institutional knowledge, often published in obscure articles, that already addresses these issues but which is missing from online data.
ResearchSpace is designed to tackle these problems. It provides a dynamic semantic environment that enables organisations to represent new knowledge free from the constraints of databases without the need for expensive software re-coding and data modelling. Significance and relevance changes over time and the ability to respond to this through everyday knowledge processes, linking existing back office knowledge processes directly to public engagement and collaboration, creates a responsive organisation, but within and effective and efficient system. In practice a new piece of information, even if it represents a completely new category or source of knowledge can be added to the system, extending the knowledge base and immediately becoming part of a fully provenance, structured, searchable and computer readable object, with semantic correspondence with other data. It can express is provenance using links to empirical evidence as well as express levels of uncertainty using a formal language. This is in contrast to existing systems in which new historical information of great significance has to be added as paragraph in a free text field.
For more historical background about the issues discussed see The root of the problem – Ontology, Epistemology and the problem with Vocabulary.
A note on Linked Data
Linked data works by assigning Web addresses, the ones that we commonly use for Web Pages, to structured data. Structured data in databases are values that have no semantics, but they have data types (string, numbers, etc.) that technically validate them. In the world of Linked Data information can also typed against an entity or class of thing. For example, a person will have a web address and be identified explicitly as a person. An event may also have a web address and be typed as an event! The web address that connects the two will also have a Web address which is associated with a semantic relation. For example:
<http://www.researchspace.org/id/person/joebloggs> (joe bloggs)
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> (is of the type)
<http://www.cidoc-crm.org/cidoc-crm/E21_Person> (Person)
Using prefixes these addresses can be shortened. For example http://www.researchspace.org/id/person/ can be shortened to person: and so on. This then becomes:
person:joebloggs rdf:type crm:E21_Person
and then perhaps
event:christmas_lecture rdf:type crm:E5_Event
person:joebloggs crm:P11_participated_in event:christmas_lecture .
Friendly labels then provide a less technical view in: Joe Bloggs participated in Christmas lecture
In a database there are no such semantics. We infer through experience, we are given training, and we are helped by a user interface that the data is dependent on for interpretation. Meaning is implicit and because the data depends on this interpretation it is necessarily simplistic, and the data itself is unable to provide reuse and sharing qualities because the inferences are often dependent on local conventions. Software layers provide the logic and interpretation but this which means that applications are inflexible and expensive to change. Their development scope is limited to particular functions despite the reality of cross functional processes.
We all use institutional applications where the burden of serving them is not match by their benefit to our evolving work, and therefore most of our knowledge is spread across documents and spreadsheets, which is by far the most used structured data tool. In the example, we are explicit about what things are and the logic and semantics of their associations. This means that there is less dependency on software, and the data is ‘computer readable’. It can evolve and grow unlike the traditional database. Theoretically this means more flexible, intelligent, richer and cheaper applications – yet Linked Data consultancy trapped within a niche bubble is still expensive leading to too many funded pilots but very few permanent institutional platforms.
Across different CH groups, the problem outlined above is apparent in different parts of the sector. In the Linked Data world technologists have simply taken old standards and old vocabularies used in existing database systems and converted them into Linked Data schemas and therefore transferring the problem of integration into a supposedly integrating environment. Vocabularies are important epistemological tools but are not a mechanism for data integration because they are naturally aligned with different vantage points. This is why an ICOM (International Council of Museums) documentation group has spent the last 20 years developing an integrating ontology. The CIDOC CRM, developed by the CRM Special Interest Group, is an ontology, which means that it is based on “being”. That is, the nature of reality as it scientifically exists based on empirical evidence as a universal way of integrating projects regardless of their epistemological objectives and vocabulary needs. Most ontologies used in Linked Data are in fact vocabulary based (they are really schemas) and therefore do not solve the problem of sharing and reuse.