| |
In this presentation of requirements the individual requirements are set
apart as numbered statements in order to facilitate discussion. Similarly, the
consequent features are set out as subordinate statements that bear an
identification letter, as in:
- A requirement from the linguist's point of view
- A consequent feature of the implementation
- Another feature of the implementation
The first requirement deals with the need for longevity of access far into
the future. This aspect of language documentation and description is covered in
detail in Bird and Simons (2002); only a few key points are
noted here:
- Language resources, especially those describing endangered languages, need
to be accessible by any interested party long into the future.
- The archival form of electronically encoded resources should not be in a
proprietary, binary format, since the need for proprietary software limits the
audience who can access the resources and such formats are likely to become
obsolete and inaccessible within a few years.
- The archival form of electronically encoded resources should not be an
interactive Web application, since upgrades to hardware and system software
typically cause these to cease to function within a few years.
- The archival form of electronically encoded textual resources should be
based on clear text formats that can be read with any text editor and many
other tools.
- The complexity of annotated texts is such that XML is the format of choice
for meeting these requirements.
Microsoft Word documents provide an example of a proprietary, binary format
that is not acceptable for long-term preservation of information. Plain text
documents formatted with line breaks and spaces are an example of a format that
meets requirements a through c; so are tab- or comma-delimited
representations of spreadsheets or data tables. But most language resources
have a more complex structure involving hierarchy and cross-reference, thus a
more sophisticated representation is needed. Markup based on the XML standard
meets all the above requirements and is now supported by such a wide variety of
tools (both open and proprietary) that it has become the clear choice for
archival formats. But what should the nature of the markup vocabulary be?
- Linguists need to be able to do more than just read language resources in
display format; they also need to be able to manipulate the content by
selectively accessing individual items of information.
- The archival form of electronically encoded resources should not follow a
strategy of presentational markup; that is, the markup vocabulary should not be
one that simply identifies what the information will look like when displayed.
- The archival form of electronically encoded language resources should
follow a strategy of descriptive markup; that is, the markup vocabulary should
identify what the individual pieces of information are from a linguistic point
of view.
- The markup vocabulary for a particular language resource should identify
all of the elements of information that are contained within it, not just some
of them.
- Users still need a presentational display of the resource; this should be
accomplished by applying a stylesheet to the descriptively marked up resource.
HTML markup, when applied to language resources, is an example of
presentational markup. Though it does have the features of longevity needed for
an archival format, it does not offer linguists the ability to do automated
processing of a linguistic nature, such as to answer the query "What are
the part-of-speech categories used in tagging this text?" For this purpose
a markup vocabulary that specifically identifies the linguistic significance of
each piece of information is needed. But simply having a markup vocabulary is
not enough; each marked-up resource also has a grammar that defines how the
individual markup elements may combine to form valid resource.
- The linguist creating a language resource needs for the markup of the
resource to be consistent with his or her plan for its content and structure.
- Best practice requires the use of a markup schema (such as an XML DTD or an
XML schema) to validate a given resource as conforming to its plan for the
content and structure.
- A single markup schema that sanctions all common practices in structuring
the content of a particular kind of resource will be too permissive to
constrain any single resource to the specific plan of its creator. (See
Simons 1998 for a discussion of this point with respect
to the TEI DTD for print dictionaries.)
- There is enough convergence of practice that it will be possible to develop
one or more specific markup schemas that can be recommended for widespread use
while being adequately constraining.
- There will always be plans for content and structure that are unique enough
to require that a unique markup schema be devised for the resource. (This is
one of the conclusions of the 2001 EMELD workshop.)
These consequences of requirement 3 thus mean that there will be multiple
markup schemas, even in the context of best practice. In order to achieve
interoperability of resources when there are multiple markup schemes we will
need to introduce a meta-level in our approach to markup:
- Linguists need to be able to query and otherwise manipulate multiple
language resources in a single operation, even though they may individually
have different markup vocabularies and schemas.
- As a foundation for interoperability, there must be a shared ontology (Langendoen and others 2002) for the kinds of information
that are marked up in language resources.
- As the bridge to interoperability, each resource must have a metaschema
that formally documents how the elements and attributes of its markup schema
map onto the concepts of the common ontology.
- The metaschema must be separate from the language resource (rather than
being an integral part of it) so that multiple resources can share the same
metaschema.
- It must be possible for a third party to create a metaschema for a resource
that lacks one without changing the resource itself. (This implies that the
linkage from metaschema to schema to resource is specified through metadata.)
Finally, it is not enough that electronically encoded resources are created.
They must also be found and used by others long into the future. This implies a
final set of consequences having to do with archiving.
- Linguists, educators, speakers of the language, and any other interested
citizens of the world need to be able to find and use electronically encoded
language resources long into the future.
- Electronically-encoded language resources (with associated schemas and
metaschemas) must be deposited into archives that can guarantee their long-term
preservation and access.
- In order to make it possible for potential users of the resource to
discover that the resource exists, a metadata description of it needs to be
written and published in a searchable catalog of worldwide language resources.
- In order to help potential users of the resource judge the relevance of the
resource, the metadata description needs to include information like
identification of the specific language and a characterization of what type of
resource it is from a linguistic point of view.
The Open Language Archives
Community is already in place with an infrastructure that meets these
needs, and EMELD will build on this infrastructure.
|
|