Funding for the Methods Network ended March 31st 2008. The website will be preserved in its current state.

The Sheffield Corpus of Chinese

The initial feasibility study 'Chinese Texts in Electronic Form for Linguistic Analysis' was based on a limited number of Chinese text samples from the Song (960-1279), Ming (1366-1644) and Qing (1644-1911) dynasties: a formal prose in the Southern-Song period; a martial arts fiction in the Ming Dynasty; and a general fiction in the Qing Dynasty.

The School of East Asian Studies and the Humanities Research Institute at the University of Sheffield have collaborated on the continuing development of the SCC. The feasibility study, which ran from 7 December 2003 to 8 September 2004 was funded by the British Academy. Currently, despite limited funding, the project continues to make effective progress.

The Project

The objectives achieved at the conclusion of the feasibility study were as follows:

  • The establishment of the Sheffield Corpus of Chinese;
  • The rudimentary form of the SCC at the end of the feasibility study, the first diachronic corpus of Chinese of its kind, contained three texts representing three different historical periods: a formal prose in the Southern-Song period; a martial arts fiction in the Ming Dynasty; and a general fiction in the Qing Dynasty. Substantial progress has since been made and there are now 41 fully marked-up texts in the SCC.
  • The text material used for the feasibility study amounted to c. 18,000 words. The corpus now has c. 420,000 words.
  • All the texts are segmented and every character is part-of-speech marked-up (tagged);
  • Texts of whole chapters have been used for the corpus rather than uniformly sized samples of texts because we believe that a corpus made up of whole documents is open to a wider range of linguistic studies than a collection of short samples.
  • Parallel English translations (c. 74,000 words) were added to the three initial texts so as to broaden the accessibility of the corpus and to facilitate contrastive study between English and Chinese. If adequate funding is forthcoming, English translations of other texts in the corpus will be added.
  • The website (URL: http://www.sheffield.ac.uk/scc) presents the database and the texts and allows the texts to be read in a conventional fashion, searched or analysed on a character by character basis and word category basis, etc.
  • The integral full-text search and analysis system that has been developed enables linguistic analysis to be carried out and, for example, can generate search tables and statistical data.

The long-term aim of the Sheffield Corpus of Chinese (SCC) is to provide an extensive digital resource of fully marked-up Chinese texts from the tenth century BC to 1911, arranged in seven sub-periods and 16 genres. The corpus, with its integral analysis software, will be a valuable research facility for scholars and researchers of diachronic studies of the development and varieties of the Chinese language. It will facilitate expansion of the scope of earlier investigations in Chinese historical linguistics. Studies on the historical syntax of Chinese often omit a thorough diachronic investigation across sections of data from different periods of Chinese history that would enable a better understanding of historical linguistics. The main reason for this omission is the lack of readily available and suitable corpora of historical Chinese texts, not to speak of corpora of fully marked-up Chinese texts. As the number of texts increases, the SCC will address the general lack of corpora for a wide range of historical linguistic studies.

Primary Outcomes

The feasibility study of the 'Chinese Texts in Electronic Form for Linguistic Analysis Project' established the initial form of the SCC which continues to develop. All the texts in the SCC were segmented and parts-of-speech tagged using a mark-up scheme developed in the context of XML. The initial form of the SCC at the completion of the feasibility study had a tag set of 21 word classes with 49 categories and contained a full-text retrieval and search system that could locate and produce frequency tables of keywords specified by users both on a character-to-character basis and a word category basis. Parallel English translations were added. The application of XML to Chinese is still at an early stage so the establishment of the SCC has made a significant contribution in applying this technology to the language. As the SCC continues to develop, it will address the lack of diachronic corpora with fully marked-up texts and will promote and facilitate a wide range of diachronic, linguistic and other studies.

Publications/Further Reading

Hu, X. L., Williamson, N. and McLaughlin, J. (2005) 'Sheffield Corpus of Chinese for Diachronic Linguistic Study', Literary and Linguistic Computing 20:3, 281-293.

Hu, X. L. and McLaughlin, J. (under review) 'Syntactic positions of prepositional phrases in the history of the Chinese language: using the developing Sheffield Corpus of Chinese for diachronic linguistic study'.

Tools and Methods

Tools

XML (eXtensible Markup Language), XSLT (eXtensible Stylesheet Language Transformations); SAXON (http://www.saxonica.com/); MySQL (http://www.mysql.com/)

Method Categories

Data Structuring and Enhancement; Data Analysis; Data Publishing and Dissemination.

Specific Methods

Automated grammatical mark up of Chinese texts from an accumulative in-house lexicographical list; rule-based analysis and correction of mark up according to grammatical context; search and analysis of the corpus by character, grammatical context, subcorpus period and genre.

Metadata Standards

The texts are stored in Unicode UTF-8 format. Metadata on the texts is currently stored in the project's own internal format. The metadata will be expressed in a standardised way (probably Dublin Core) as the corpus is developed.

Project Website

http://www.sheffield.ac.uk/scc

http://www.hrionline.ac.uk/scc

Staff and Advisors

Principal Staff Member

  • Dr. Xiaoling Hu, School of East Asian Studies, University of Sheffield

Other Staff Members

  • Nigel Williamson, Humanities Research Institute, University of Sheffield
  • Jamie McLaughlin, Humanities Research Institute, University of Sheffield

AHDS Methods Taxonomy Terms

This item has been catalogued using a discipline and methods taxonomy. Learn more here.

Disciplines

  • Non-European Literature and Languages
  • Linguistics

Methods

  • Data Analysis - Collating
  • Data Analysis - Collocating
  • Data Analysis - Concording/Indexing
  • Data Analysis - Content analysis
  • Data Structuring and enhancement - Markup/text encoding - descriptive - conceptual
  • Data Structuring and enhancement - Markup/text encoding - descriptive - document structure
  • Data Structuring and enhancement - Markup/text encoding - descriptive - linguistic structure
  • Data Structuring and enhancement - Markup/text encoding - descriptive - nominal
  • Data Structuring and enhancement - Markup/text encoding - presentational
  • Data Structuring and enhancement - Markup/text encoding - referential