Case Studies
STAR – Semantic Technologies for Archaeological Resources
STAR is an AHRC-funded project based in the Department of Archaeology at the University of Glamorgan. The project is a collaboration between the University of Glamorgan, English Heritage and the Royal School of Library and Information Science, Denmark. It runs from January 2007 to January 2010.
STAR aims to exploit means to increase access to the vast mass of archaeological material currently available on the Internet. Such materials range from structured databases and repositories of digitised materials to ‘grey literature’ (that is, unpublished reports documenting archaeological projects and investigations). The project is investigating the potential of semantic terminology tools for widening and improving access to these resources, exploring the possibilities of combining a high level, core ontology with domain thesauri and natural language processing techniques.
STAR builds on the outcomes of FACET (http://www.comp.glam.ac.uk/~FACET with links to a web demonstrator), a recently-completed EPSRC-funded research project that investigated the automatic expansion of faceted search queries, using the semantic relationships inherent in a thesaurus, in this case the Getty Art and Architecture Thesaurus (AAT).
The primary outcome of this project will be the design and implementation of a demonstrator search system and an evaluation of this system, with a view to its wider application in archaeological (and other) domains.
The Project
Context and Aims
Despite the advances made in web search engine technologies in recent years, they remain inefficient as a means of access to resources in that they may only link or enable searches of datasets in the most straightforward ways. They are thus unsuitable for the conceptual or subject-based searches that would be of positive value to academic research or serious public inquiry. As any user of a typical search engine may testify, significant differences in results stem from trivial variations in search statements and from related but differing conceptualizations of a research inquiry.
In addition, scholarly cross-domain research often involves multi-concept expressions of a research question or information need. Conventional tools do not facilitate the necessary generalisation of the search statement when an exhaustive search is required. Mapping between lay and specialist vocabularies is also problematic.
There is therefore a need for tools which will assist in the formulation and refining of the search process via the application of a controlled vocabulary of concepts.
The archaeological sector is already experienced in employing Knowledge Organisation Systems (KOS - such as thesauri) to assist semantic interoperability. However, such tools are often not fully integrated into searching and indexing systems and online practice has tended to replicate traditional print environments. The full potential of knowledge resources in online environments has yet to be explored.
The CIDOC Conceptual Reference Model (CRM) is emerging as a potential standard, high level ontology within cultural heritage. It is envisaged as ‘semantic glue’ mediating between and binding different sources and types of information. English Heritage have extended the CRM to reflect the processes and events involved in archaeological excavation and analysis. This has particular relevance for archaeological cross-domain research.
It is essential for retrieval systems to cope with faceted queries (i.e. allowing the dynamic synthesis of terms in indexing or searches). With Boolean searching it is necessary to articulate the same combination of terms at the same level of specificity as the indexer or author. Semantic expansion, assisted by facet structure, can reduce the recall problems caused by missing terms and partial matches Systems such as FLAMENCO (FLexible information Access using MEtadata in Novel COmbinations) which uses hierarchical faceted metadata in a manner that enables users to modify queries, allow users flexible navigation through large information spaces.
In order to extract information from free text documents such as ‘grey literature’, the project will make full use of the range of tools that have resulted from recent advances in language engineering. Such tools include lexical databases such as WordNet, part-of-speech taggers, toolkits such as GATE and large corpora with context data for statistical methods. A combination of CRM, GATE and WordNet has already been applied in an in ontology-driven extraction of artists’ biographies from Web documents (see: Alani H., Kim S., Millard D., Weal M., Hall W., Lewis P., Shadbolt N. 2003. Automatic Ontology-based Knowledge Extraction and Tailored Biography Generation from the Web. IEEE Intelligent Systems, 18(1), 14-21 ). Research at Glamorgan has investigated the integration of linguistic techniques with KOS; syntactic parsing and part-of-speech tagging help avoid ambiguities (e.g. where homographs occur) when mapping to controlled terminology.
Digital Archaeology has made full use of the web for many years to disseminate data and reports. For example, the HEIRNET portal (http://www.britarch.ac.uk/HEIRNET/) offers a Z39.50 search across a wide range of heritage information via a map-based interface and OASIS (Online AccesS to the Index of archaeological excavations (http://ads.ahds.ac.uk/project/oasis/) makes possible online recording of fieldwork reports from the growing body of archaeological investigations (The OASIS project is a collaboration between the Archaeology Data Service, English Heritage and the Archaeological Investigations Project, funded by the Research Support Libraries Programme). This unified index contributes to the ADS online catalogue, ArchSearch. The ADS is building an online virtual library of ‘grey literature’, directly linked from the index. These are impressive, operational technical achievements. However, to date only a basic, fielded Boolean search is possible. No terminological or expansion tools are currently available to address the needs detailed above.
The research demonstrator that STAR is building will use, as its key materials, ‘grey literature’ and data drawn from the Roman (and some Iron Age) excavation reports from the English Heritage Raunds Project and other UK excavations. The demonstrator will be evaluated by collaborators and archaeological users and the evaluation process will also consider the cost/benefits of the detailed work in mapping the Raunds datasets to the CRM and the types of applications and scenarios where this could be most fruitful.
The Raunds project focuses on a large area in Northamptonshire, and is concerned mainly with Roman (and some Iron Age) materials. It is a highly complex, long-running project and, as English Heritage has strict standards in its commission process and the handling of data, it is ideal as a testing ground for STAR, in that there is an overall consistency in the data. In archival digital records, data is attached to quite small entities (‘contexts’), while ‘grey’ reports (and full publication) tend to speak in more general terminology, sometimes relating to larger entities (groups, phases). Site records have differing degrees of granularity - going from the site scale data held by geophysics databases, through interpretive, derived information about defined areas or features, down to individual contextual data. Contexts might be artefacts such as Wall, Floor, Flagstone, Posthole, Pit, etc. with identifiers and dimensions. For each individually defined context there are free text (interpretation) entries (e.g. “This context layer exhibited a degree of burning, with charcoal inclusions and was heavily compacted, suggesting it was a floor layer within the room defined by associated walls of the beam slots w,x,y,z”).
The Demonstrator will be evaluated by a range of users, following a task-based evaluation methodology (see: Nielsen M. L. 2004. Task-based evaluation of associative thesaurus in real-life environment. Proceedings ASIST 2004 Annual Meeting.). The aim will be to reflect real-life search situations and needs as far as possible. This also builds on evaluation experience in the FACET project, where rich details of the user’s interaction with the system were recorded via transcripts of think-aloud sessions, screen capture videos and log files of interactions.
The project will be represented at various computing and archaeological events, such as the annual CAA conferences. A presentation was given in November 2007 at the Second International Seminar on Subject Access to Information, organised by the International Relations Group of the Finnish Research Library Association.
Project outcomes will be published in academic conferences and journals, both digital archeological and computer/information science.
Initial Progress
Mappings have been made from the CRM to three different database formats, where the data has been extracted to RDF and the mapping expressed as an RDF relationship. The data extraction process involved selected key data relevant for STAR purposes from the following archaeological datasets:
- Raunds Roman Analytical Database (RRAD)
- Raunds Prehistoric Database (RPRE)
- York Archaeological Trust (YAT) Integrated Archaeological Database (IADB)
The approach taken for the exercise was to extract modular parts of the larger data model from the RRAD, RPRE and IADB databases via SQL queries, and store the data retrieved in a series of RDF files. This allowed data instances to be later selectively conbined as required.
Thesaurus data was received from English Heritage National Monuments Record Centre, as CSV format files. It was converted into the standard SKOS RDF format for use in the project. STAR has also contributed terminology services to the DelosDLMS prototype next-generation Digital Library management system, built on the OSIRIS middleware environment (ETH Zurich and University of Basel). The service works with thesauri or related Knowledge Organization Systems (KOS) represented in SKOS format (Binding et al. 2007).
Publications and Recommended Reading
- Alani H., Kim S., Millard D., Weal M., Hall W., Lewis P., Shadbolt N. 2003. Automatic Ontology-based Knowledge Extraction and Tailored Biography Generation from the Web. IEEE Intelligent Systems, 18(1), 14-21.
- Bates M. 2002. The Cascade of Interactions in the Digital Library Interface. Information Processing and Management 38, 381-400.
- Binding C., Tudhope D., 2004 KOS at your Service: Programmatic Access to Knowledge Organisation Systems, Journal of Digital Information, 4(4) http://journals.tdl.org/jodi/article/view/jodi-124/109
- Binding C., Brettlecker G., Catarci T., Christodoulakis S., Crecelius T., Gioldasis N., Jetter H-C., Kacimi M., Milano D., Ranaldi P., Reiterer H., Santucci G., Schek H-G., Schuldt H., Tudhope D., Weikum G. 2007. DelosDLMS: Infrastructure and Services for Future Digital Library Systems, 2nd DELOS Conference, Pisa. http://www.delos.info/index.php?option=com_content&task=view&id=602&Itemid=334 (pdf)
- Blocks D., Cunliffe D. Tudhope D. 2006. A reference model for user-system interaction in thesaurus-based searching. Journal of the American Society for Information Science and Technology, 57(12), 1655-1665, Wiley.
- Cripps P., Greenhalgh A., Fellows D., May K., Robinson D. 2004. Ontological Modelling of the work of the Centre for Archaeology. CIDOC CRM Technical Paper. http://cidoc.ics.forth.gr/technical_papers.html DELOS FP6 Network of Excellence on Digital Libraries. http://www.delos.info
- Doerr M., Hunter J., Lagoze C. 2003. Towards a Core Ontology for Information Integration. Journal of Digital Information, 4 (1), http://jodi.ecs.soton.ac.uk/Articles/v04/i01/Doerr/
- Hearst, Elliott, English, Sinha, Swearingen, and Yee (2002). Finding the Flow in Web Site Search. Communications of the ACM, 45 (9).
- Järvelin K, J. Kekäläinen J, Niemi T. 2001. Expansion tool: concept-based query expansion and construction. Information Retrieval, 4. 231 – 255.
- Nielsen M. L. 2004. Task-based evaluation of associative thesaurus in real-life environment. Proceedings ASIST 2004 Annual Meeting.
- Tudhope D., Binding C., Blocks D., Cunliffe D. 2006. Query expansion via conceptual distance in thesaurus indexed collections. Journal of Documentation, 62 (4), 509-533. Emerald.
- Tudhope D., Nielsen M. 2006. Introduction to Special Issue on Knowledge Organization Systems and Services. New Review of Hypermedia and Multimedia, 12(1), 3-9. Taylor & Francis
- Tudhope D., Binding C. 2006. Towards Terminology Services: experiences with a pilot web service thesaurus browser. ASIST Bulletin, 32(5), 6-9, June/July. Available online at http://www.asist.org/Bulletin/Jun-06/tudhope_binding.html
Materials, Tools and Methods
Project URL
http://hypermedia.research.glam.ac.uk/kos/star/
Source Material Used
‘Grey literature’; data drawn from excavations carried out by English Heritage and others.
Resource Created
Research Demonstrator (to be completed by end of project)
Tools used
- W3C RDF validation service (http://www.w3.org/RDF/Validator)
- W3C SKOS validation service (http://www.w3.org/2004/02/skos/validation)
- Drive RDF parser (http://www.driverdf.org)
- GraphViz graph visualisation software (http://www.graphviz.org)
- Altova Semantic works (http://www.altova.com/products/semanticworks/semantic_web_rdf_owl_editor.html)
- Altova XMLSpy (http://www.altova.com/products/xmlspy/xml_editor.html)
- JSLint – Javascript validation tool (http://www.jslint.com/)
- Microsoft Visual Studio 2005 (http://msdn.microsoft.com/vstudio/)
- Protege (http://protege.stanford.edu/)
Standards used
CIDOC Conceptual Reference Model (CRM); SKOS (RDF/XML Representation of Thesauri); XML Web Services.
Subject Domain
Archaeology; computer science; library and information science.
Method Categories
Data Analysis; Data Publishing and Dissemination; Data Structuring and Enhancement.
Staff and Advisors
Principal Staff
- Douglas Tudhope, School of Computing, University of Glamorgan
Other staff members
- Ceri Binding (Research Fellow), School of Computing, University of Glamorgan
- Andreas Vlachidis (Research Student), School of Computing, University of Glamorgan
Collaborations and External Expertise
- Keith May, English Heritage
- Sarah May, English Heritage
- The Archaeology Data Service (and OASIS project)
- Marianne Lykke Nielsen, Royal School of Library and Information Science, Denmark
- Traugott Koch, Max Planck Institute, Berlin
- Martin Doerr, FORTH, Greece
- Anders Ardo, University of Lund, Sweden
AHDS Methods Taxonomy Terms
This item has been catalogued using a discipline and methods taxonomy. Learn more here.
Disciplines
- Archaeology
- Ancient History
- History
Methods
- Data Analysis - Content analysis
- Data Analysis - Data mining
- Data Analysis - Record linkages
- Data Analysis - Searching/querying
- Data Capture - Usage of existing digital data
- Data publishing and dissemination - Cataloguing / indexing
- Data publishing and dissemination - Searching/querying
- Data publishing and dissemination - Textual resource sharing