Monguor 蒙古儿,土 Data Collection

Page Index

The content of this page was developed from the research of Dr. Arienne Dwyer,
Dr. Wang Xianzhen, Dr. Limusishiden (Li Dechun) and Ms. Lu Wanfang.


The documentation of Monguor was mostly carried out by native speakers, under the direction of Dr. Arienne Dwyer of the University of Kansas. The advantage of this method of collection is that it was possible to collect a very large amount of raw data, but this material then had to be linguistically analyzed.

How the Project was Set Up

In 2000, Dr. Kevin Stuart, Dr. Wang Xianzhen, and Dr. Dechun Li collaborated with Dr. Dwyer to set up local data-collection teams. These were established in the three geographically distinct Monguor areas:

These regions were chosen because of the varying influence of Tibetan and Mandarin on their respective dialects. A fourth base was established at Xining for data processing.

Each area's team was provided with

Learn more about choosing hardware

Local native-speaker researchers were consulted about the most prominent genres of textual material in their regions (e.g. love songs, conversation, weddings), and urged to collect two high-quality samples of each.


Local researchers were trained in

  1. audio-visual recording techniques
  2. transcription
  3. data delivery procedures

Transcription was done using Transcriber. This was chosen for its intuitive user-interface which allowed the data-collectors, who had no previous formal computer or linguistic training, to transcribe their data quickly and easily.

However, Dr. Dwyer's team later found that TasX is a more useful annotation tool, due to its open-source code and xml file formats. They have developed XSL stylesheets that transform their Transcriber files into TasX files, and are also currently working on a simplified user interface for use by non-academic language documenters.

The data providers were asked to annotate the audio recordings in two versions. One was to be a transcription of the material in a variety of pinyin adapted to accommodate Monguor phonology; the other was to be either a Mandarin or Tibetan translation, according to the speaker's preference.

Keyman was used to facilitate entry of nonstandard characters.


Dr. Dwyer wanted to collect metadata in the IMDI format. However, this versionof metadata is complex, and requires a special metadata editor. She soon realized that it was not a viable option for her data providers to learn yet another piece of software in order to input metadata into a database. Most providers had full time jobs unrelated to the project, and in addition the editor had software incompatibilities with available Chinese operating systems. It was decided, therefore, to give providers paper forms to fill in with all the required metadata information.

metadata image

This data was later entered by team leaders into the IMDI Metadata Editor.

Note that EMELD uses and recommends the Dublin Core metadata structure. This is the structure used by OLAC. However, for some purposes -- e.g. when a great deal of detail is needed for describing a particular resource -- the IMDI metadata set might be preferable. What metadata set is used depends on the needs of the researcher.

Data cycle


  1. Local researchers make Audio-Visual recordings in Minhe, Huzhu and Tongren.
  2. Metadata is concomitantly recorded on paper forms.

Digitization and Archiving

  1. Approximately every six weeks, the Field Coordinator Dr. Wang makes the six hour trip from the provincial capital Xining to each of the field sites and collects the original recordings and metadata. He then delivers these to Dr. Li and Ms Lu in Xining for digitization.
  2. This digitization team transfers the Audio-Visual data to a CD or DVD. The audio track is split from video recordings, so that local transcribing teams can open it in Transcriber – transcribing combined Audio-Visual files would have required learning and using yet another piece of software.
  3. The Field Coordinator then archives the original recordings in Xining and brings one copy back to the field sites for the local researchers to transcribe. This entails another six hour trip for him.

Local annotation and digitization of metadata

  1. Local researchers do a simple Pinyin annotation, and a free translation into either Mandarin or Tibetan.
  2. These transcriptions are carried by the Field Coordinator back to Xining where the metadata is entered into a computer.

Data transfer to Kansas

  1. Annotations - but not media files - are emailed to the Kansas team.
  2. Copies of archived CDs and DVDs are carried annually to Kansas.


In November 2002, the first annotations began arriving at the University of Kansas from China. Chinese translations early on in the project, and Tibetan translations all through it, were being entered using non-Unicode fonts in Microsoft Word. The Kansas team thus had to

  1. convert these translations into unicode compliant XML
  2. manually correct any problems with this conversion
  3. time-align the translations to their audio files (in Transcriber)

A standardized manual orthography was also established at this stage, based on an existing orthography originally designed for Huzhu Monguor.

Future plans

Dr. Dwyer believes believes that the use of hand-held PCs might make metadata collection easier in future. The chore of filling in a form for each day's recording – while easier than entering it into an access database – was understandably of little interest to the local researchers and was often simply forgotten. A hand-held device could be carried with the recording equipment, and metadata could then be easily entered at the same time as the data was being recorded.

Follow the path of the Monguor data

  1. Get Started: Summary of the Monguor conversion
  2. Convert Data: Conversion page (Classroom)
  3. Create a Lexicon: FIELD tool (Workroom)
  4. Present Data: Stylesheets page (Classroom)

User Contributed Notes
E-MELD School of Best Practices: Monguor: Data Collection
+ Add a comment
  + View comments

Back to top Credits | Glossary | Help | Navigation | Site Map | Site Search