Funding for the Methods Network ended March 31st 2008. The website will be preserved in its current state.

Using Large-Scale XML Corpora in Language and Literature Workshop Report

Report by Lou Burnard

Introduction

Introduction Since its first release in 1994/5, the British National Corpus (BNC) has become a key resource for researchers, learners, and teachers in English language teaching, linguistics, Natural Language Processing, lexicography, cultural studies, and many related fields. It remains amongst the best known and most frequently accessed resources of its type worldwide. In March 2007, a new edition of the corpus was released in XML format. The decision to convert the corpus into XML was based on a number of factors: XML is increasingly the standard for online text creation and publication; tools for processing XML resources are ubiquitous; other linguistic resources comparable to the BNC are increasingly created using XML. Converting the BNC into XML thus improves its usability by making it possible for users to access it with their own tools, drawn from a wide range of new sources, and to integrate it with other resources.

Despite its wide take-up on the internet, XML remains less well understood by researchers and resource users from a non-technical background, who may therefore find it difficult to identify or make use of existing information about how to benefit from the opportunities available when using XML.

This one-day workshop therefore aimed to introduce the technologies needed to unlock the potential uses of large-scale XML-encoded language corpora, with a particular focus on the BNC XML Edition. The workshop was aimed at two distinct groups of researchers. The first group contained language or literature specialists who are aware of the potential for corpus-based methods in language pedagogy or literary research and want to apply them either with their own corpus material or with the BNC in its new format. The second group was made up of technical specialists who are aware of the demand for corpus resources and wanted to gain practical experience of using XML for corpus creation, development, and usage. Through the workshop we were hoping to stimulate dialogue between the two groups, and promote a shared understanding of common goals.

Read the report...

AHDS Methods Taxonomy Terms

This item has been catalogued using a discipline and methods taxonomy. Learn more here.

Disciplines

  • English Literature and Languages
  • European Literature and Languages
  • Linguistics
  • History

Methods

  • Data Analysis - Collating
  • Data Analysis - Collocating
  • Data Analysis - Concording/Indexing
  • Data Analysis - Content analysis
  • Data Capture - Usage of existing digital data
  • Data publishing and dissemination - Textual resource sharing
  • Data Structuring and enhancement - Markup/text encoding - descriptive - conceptual
  • Data Structuring and enhancement - Markup/text encoding - descriptive - document structure
  • Data Structuring and enhancement - Markup/text encoding - descriptive - linguistic structure
  • Data Structuring and enhancement - Markup/text encoding - descriptive - nominal
  • Data Structuring and enhancement - Markup/text encoding - presentational
  • Data Structuring and enhancement - Markup/text encoding - referential
  • Strategy and project management - Usability analysis