^^IKE's_Zone^^: Literature: Non-mainstream Languages and Speech Recognition: Some Challenges

KRISTEN PRECODA
SRI International

Abstract:
Most languages of the world have not been the focus of a speech recognition development effort, and the choice of technical approaches best suited to a language can be substantially impacted by the cultural context surrounding it. As the technologist and the teacher of or expert in a non-mainstream language and its culture are typically not the same person, issues that are self-evident to one may come as a surprise to the other. The goal of this paper is to add one plank to the bridge between these two areas of expertise by highlighting some aspects of non-mainstream linguistic contexts that pose challenges to the usual model of speech recognition system development and by suggesting alternative ways to meet these challenges.

KEYWORDS

Automatic Speech Recognition (ASR), Speech Technology, Less Commonly Taught Languages (LCTL), Computer-assisted Language Learning (CALL)

INTRODUCTION

Non-mainstream languages, or less commonly taught languages (LCTLs), include most of the languages of the world. Some languages, such as Thai or Bahasa Indonesian, are spoken by very large populations but are not taught often for historical reasons. Political events or natural disasters may bring others to prominence; for instance, some competence in languages such as Serbian or Somali may be needed by relief workers, peacekeepers, and diplomats. Revitalization or maintenance efforts may be undertaken for endangered languages such as Hawaiian or Lakota. In all of these situations, speech-enabled software can be especially useful for learners since practice opportunities with fluent speakers, movies, broadcasts, and other materials may be quite limited. It would be desirable to have speech recognition technology available in LCTLs with capabilities similar to those available to learners of English, Spanish, and other

“major” languages so that learners may practice limited dialogues, learn vocabulary through oral repetition, receive feedback on their pronunciation, and so on.

However, speech recognition research and technology has largely developed in an environment focused on a very small number of languages, and assumptions

229

that hold for those languages have strongly influenced speech recognition methods and approaches. One of the most important assumptions is that various kinds of resources are available in large quantities at relatively low cost. These resources take a number of different forms, including speakers, specifically speakers of an age demographic matched to the learners; speech recordings of suitable quality and diversity; word lists with phonemic transcriptions, in electronic form; texts varied in topic and genre, also in electronic form; linguistic descriptions with descriptions of dialectal variation; and so on. Many of these resources are not currently available, nor are they inexpensive to create, in most languages of the world. Developing speech recognition technology for an LCTL can therefore pose real challenges, and the differences among languages and their contexts mean that the particular constellations of challenges to be met are nearly as varied as the languages themselves. In this paper, we discuss a few issues recently encountered in work with Lakota, an American Indian language spoken in South Dakota by an estimated 6,000 speakers (Grimes & Grimes, 2002), and Pashto, an Indo-Iranian language spoken primarily in Afghanistan and Pakistan by around 23 million speakers (Tegey & Robson, 1996). Examples will be given of practical repercussions of these issues.

One challenge is the number of different speakers required in order to build a speaker-independent recognizer. Individuals vary in vocal tract size and shape and in speech style. Thus, for a speech recognizer to be able to operate effectively for any given language learner, it must be trained on data from a large speaker population. For some non-mainstream languages, it can be difficult or impossible to record what would generally be considered a sufficient number of speakers.

A second issue is the phonological diversity of acoustic data required. Most state-of-the-art speech recognizers use statistical models representing “phonemes in context” or “triphones,” that is, a model typically represents a given phoneme with a particular left neighbor phoneme and a particular right neighbor phoneme. In languages with any but the very smallest phoneme inventories, the number of attested triphones can be very large. In addition, many examples of each triphone must be collected because triphones do not capture all the contextual factors influencing a phonetic realization, and the models must also include allophones induced by context beyond the triphone. This large number of phonetic units, coupled with speaker variability, leads to a need for thousands of lexically different utterances, potentially representing an expensive data collection effort.

Third, an electronic word list with phonemic transcriptions is needed to supply the vocabulary which can be recognized. Creating such a list can be a substantial project for any language. However, in “major” languages such lists may exist already or be obtainable from conventional dictionaries which use some standard transcription. Lists of vocabulary may be obtained from electronic texts, and the phonological inventory may be well described and understood. In non-mainstream languages, fewer resources may be available to support the creation of word lists.

A fourth source of challenges is the status of literacy and orthographies. Many

230

languages simply have low literacy rates. In some languages, one or more writing systems may exist but not be standardized, either across writers or even within a writer. A consistent written representation must be used for transcribing acoustic data and may be used for collecting speech recordings, so it is necessary to develop such a representation and to train people working in the speech recognition development effort to use it. The effects of nonstandardized orthographies and low literacy can be wide ranging and greatly increase the difficulty of gathering and preparing acoustic data for use in training a speech recognition system.

An important characteristic of language education applications is that the tasks they need to accomplish and the conditions under which they must work differ from those of more general-purpose speech recognizers. The specificity of their goals can be taken advantage of to build a recognizer with acceptable performance in a language-learning context, as will be emphasised repeatedly in the discussion below.

Of overarching importance is creativity and flexibility in meeting technical requirements with cultural sensitivity.

REQUIREMENT: LARGE NUMBER OF SPEAKERS

Technological Desiderata

A speaker-independent speech recognizer is one that can recognize many different voices without having to be trained for each individual speaker; this is often a desired capability for language-learning software. Acoustic models, or statistical models of each sound unit, are built using many occurrences of each sound unit in many different phonetic environments. As individuals vary in vocal tract size and shape and in speech style, the parameters of statistical acoustic models for a speaker-independent recognizer must be estimated with speech data from many speakers—10s or 100s. This speech data should be, to the extent possible, representative of the population and conditions of the eventual application. For example, if the software is to be used by children of a certain age group, the speech data should focus on or at least include children of that age group to maximize recognition accuracy because the acoustics of voices change with the age of the speakers. If the speech-recognition-enabled software will be used in a quiet room with a close-talking microphone, the speech recordings should ideally be made in similar acoustic conditions with similar microphones and channel characteristics.

For simple speech recognition, the acoustic models should match and therefore recognize speech both from native or near-native speakers and from learners with a variety of more or less nonnative accents. Some speech recognizers also provide pronunciation feedback, which requires a comparison between the learner's speech and some reference, typically the target accent being taught. In one ideal scenario, two sets of acoustic models are used: one for best recognition performance, trained with natively and nonnatively accented speech, and one for pronunciation feedback, trained only with speech of the target accent to be learned.

231

Real-world Limitations

Some languages unfortunately simply do not have many speakers. Approximately half the world's languages, or several thousand languages, have fewer than about 10,000 speakers (Graff, 2001), and the number who are practically available and are willing to be recorded may be very small indeed. In many endangered languages, speakers available for recording may not be of the desired demographic. For instance, the most fluent speakers may be elderly while the target learners may be children, and this acoustic mismatch will harm recognition performance. There may be cultural restrictions on who is considered a suitable subject for recording, such as a belief that elders are necessarily “better” speakers, or restrictions that permit working only with someone of the same gender to make the recordings. Some communities feel that the possible benefits of the technology might be outweighed by the possible harm if the technologist is not a member of the community.

A sufficient number of speakers to model the target accent to be taught may not be available, rendering pronunciation feedback difficult. It is also possible that there may not be an accent that speakers agree is “standard.” Calling one dialect “standard” may imply that other dialects are less prestigious and may be socially unacceptable to some or all groups of speakers, or, alternatively, multiple dialects may all consider themselves the standard.

In some languages in which the only remaining speakers are elderly, poor dentistry may mean that an entire series of phonemes may be distorted by the available speakers.

Approaches

There are both technological and nontechnological approaches to the problem of a shortage of speakers. We will first review some nontechnological approaches which may be used in tandem with the simplest techniques for estimating the parameters of acoustic models and then mention some more complex technological solutions.

In mainstream languages, native speakers are usually readily available; this may or may not be the case in non-mainstream languages. In point of fact, however, speakers generating acoustic data do not need to be native, fluent, or even competent speakers of the language as long as they can be prompted with content to be repeated or read aloud: they only need to have good pronunciation skills. For example, there are relatively few teenagers or young adults who consider themselves fluent in Lakota, but more who have native-like pronunciation and who can thus provide suitable acoustic data. If the recognizer is to be used for teaching purposes, it is often worthwhile to collect nonnatively accented speech as well since this speech may increase recognition accuracy for learners. As noted above, models trained with speech with a nonnative accent or nontarget native accent will not be appropriate for providing pronunciation feedback.

“Found,” rather than elicited, data may increase the number and diversity of

232

speakers available. Found data includes broadcasts, oral study materials, and recordings that may have been collected for other purposes (as permitted by copyright). One difficulty with using found data is that the recording conditions may not match those in which the recognizer is to be used and may not yield recognition performance as good as matched data would. Segments with background music or overlapping speakers are generally not very useful. In addition, if a few speakers contribute most of the speech, as is often the case in broadcasts, this may impact recognition performance for other speakers. Found data must thus be used carefully.

Finally, good recognition performance in a clearly defined educational application can be achieved without building a more general-purpose recognizer. To achieve this performance, the expected learner input can be carefully directed and managed so that the speech recognition grammars can be maximally acoustically distinct to avoid having to distinguish between confusable phrases. For example, in many tasks learners are likely to produce a limited set of responses, and recognition will be more successful if questions can be phrased so that the possible responses are not highly confusable. In other words, the speech recognizer's tasks can be designed to achieve acceptable recognition performance, even given weaker acoustic models than desired.

More technological solutions include techniques developed recently for recognizing a language using acoustic models built with data wholly or partly from other languages (Schultz & Waibel, 2001). One idea is to build cross-lingual acoustic units, pooling data from several languages with acoustically very similar phones. These units may be adapted with a relatively small amount of data from the target language or used as is, depending on their performance in a target-language evaluation. These models are not ideal for pronunciation feedback.

Speech from very young speakers of a language can be simulated by applying vocal tract length mapping to speech collected from adults speaking the desired language (Claes, Dologlou, ten Bosch, & Van Compernolle, 1997). While the recognition performance resulting from the application of this technique is not as high as with acoustic models trained on children's speech, there is a substantial reduction in the recognition error rate relative to using adult models to recognize children's speech.

Another approach when few speakers are available is to gather larger amounts of speech from each to train acoustic models. When learners use the software for the first time, they can be prompted to speak a moderate number of known sentences. Speaker adaptation and transformation methods can then be applied to maximize the match between the learner and the acoustic models, resulting in reasonable recognition accuracy on limited tasks (Anastasakos, 1997).

Notes from the Field

The five Lakota reservations in South Dakota have some dialect variation, and it would be desirable to record speakers from all areas fairly for best recognition

233

performance. At the same time, personal and extended family relationships are of great importance, and the best way to find speakers interested in and willing to make recordings is through such connections. These connections, however, lead to a sampling of speech communities which is biased in ways that have no technological justification.

Physical conditions can also place constraints on speaker availability. Because few paved roads exist in some areas of the Lakota reservations, travel becomes more difficult in the winter, and it is harder to meet with speakers. There can thus be a correlation between the size of the corpus of speech recordings and the timing of the project. A similar relation obtained in another language when we were advised that data collection could proceed only in the winter because of cultural and subsistence activities in the summer.

REQUIREMENT: PHONOLOGICAL DIVERSITY IN ACOUSTIC DATA

Technological Desiderata

Most state-of-the-art speech recognizers use statistical models representing “phonemes in context” or “triphones,” that is, a model typically represents a given phoneme with a particular left neighbor phoneme and a particular right neighbor phoneme. In most languages, the number of attested triphones is quite large. In addition, data for various allophones induced by context beyond the left and right neighbor phonemes must be collected as well.

To obtain a database of sufficient phonological diversity in a language such as English in which data collection and processing is relatively inexpensive, a common approach is simply to record as many speakers as possible producing as varied speech as possible. To minimize hesitations, self-repairs, stuttering, and other disfluencies, speakers are often presented with written prompts to read aloud. Ideally, any prompt is read by only one speaker so that no sequence of phonemes is unduly represented to the exclusion of others and the acoustic models are not inappropriately biased. The prompts can be created specifically for this use or taken from written sources chosen for coverage of a large vocabulary. Alternatively, unscripted speech may be collected and transcribed, typically at much greater expense, both because of the transcription time and the need to identify and reliably label nonlexical phenomena and disfluencies. All vocalizations must be transcribed or marked since alignments of transcribed utterances to an audio waveform are typically performed automatically and will be incorrect if part of the voice in the waveform is not reflected in the transcription. In some cases, a portion of recordings may be left untranscribed and unsupervised training of the acoustic models may be performed, as long as all lexical items in the recordings are in the recognizer's phonetic lexicon.

Real-world Limitations

Collecting thousands of lexically diverse sentence texts in electronic form for use as prompts can be a substantial effort. For some languages, including many

234

endangered languages, what materials exist may document noncontemporary usage. If electronic texts are already available, they must be reviewed for a number of characteristics. Among these characteristics are unambiguity of pronunciation; for instance, in an English text, it must be clarified whether “X” will be read as “the tenth,” the name of the letter x, “times,” “by,” and so on. The prompts should be of an easily managed length, contain foreign loanwords only to the extent desired, and not be too heavily skewed in their usage of proper nouns, as newspaper articles or literature will tend to be. Writing prompts specifically for acoustic data collection is time consuming and expensive though usually possible in principle. In either case, a high degree of target language competence is needed to work with the texts and assure that they are grammatical and also linguistically and culturally appropriate.

Written prompts are not always effective because the speaker population may not be able to read aloud or may not be comfortable doing so for any of several reasons. If unscripted speech is requested, speakers may feel self-conscious and may not use as diverse a vocabulary as desired. In addition, as noted above, unscripted speech must be carefully monitored or reviewed, as must scripted speech. Even quite fluent speakers are apt to produce false starts, hesitations, slips of the tongue, and other phenomena which render transcription more difficult.

Found speech, if it is from a limited number of sources, may cover a narrow range of topics or genres and overrepresent vocabulary items which are less frequent in everyday usage. This overrepresentation may result in acoustic models which fit the particular allophonic variations in those words very well and in other words not as well; recognition performance could even be harmed relative to a distribution of training words more representative of the vocabulary to be used in the software.

Approaches

One possible approach is to design or select prompt sentences for eliciting speech very carefully, rather than presenting found text without regard to its triphone coverage, so that the sentences collect triphones efficiently. A set of sentences can be chosen for the greatest number of lexical items, the smallest number of repeated lexical items, or for the broadest or flattest distribution of triphones included. If found speech recordings are available, they may be supplemented with targeted elicited data for broader or flatter coverage by analyzing the coverage of the found data. Alternatively, some acoustic data may be discarded or downweighted if certain units are found to be undesirably overrepresented relative to others and relative to their expected usage by learners studying with the software.

In a language-learning application, often the vocabulary learners are expected to draw upon is fairly well known and is further limited when learners are responding to stimuli presented by the software. If the vocabulary that must be recognized is a known set, the required triphones are also known, and the data collection can be targeted to perform well on those words. This approach may

235

result in poorer recognition accuracy on words not in the known vocabulary, but the trade off may be acceptable for what becomes a somewhat specialized speech recognizer. In the extreme case, eliciting data matched to the expected responses of the language-learning tasks will yield reasonable performance on those expected responses, though not necessarily on responses to future revisions of the tasks or to new tasks.

An orthogonal issue is the visual or aural presentation of prompt sentences. If the speaker population to be recorded is not able to read aloud comfortably, sentences may be prerecorded by a small number of speakers and then a larger pool may repeat prompts presented aurally. Found speech may also serve as prompts for oral repetition. One of many alternative procedures is to assign one or more topics to each speaker to elicit unscripted speech with a broad vocabulary coverage.

Notes from the Field

To collect acoustic data in Lakota, we modified sentences from dictionaries and texts, of which there are a modest number available, though few are truly contemporary. This procedure yielded 2,000-3,000 sentences relatively easily but became increasingly difficult to use beyond that number as plausibly contemporary, short, politically neutral sentences were exhausted. In asking two speakers to model the sentences for oral repetition by others, it became clear that syntactic and lexical differences among the Lakota speech communities were greater than expected; what was an ordinary sentence to some speakers was not acceptable to others. In addition, many of the sentences as originally written were now perceived as archaic or obsolete and were rewritten during data collection. Fluent speakers of Lakota are a scarce resource, and their time had to be focused only on the most critical tasks, not on all where their help would have been desirable.

The situation in Lakota is slightly complicated by the existence of grammatical correlates of speaker gender. Female and male speakers use different particles for certain functions, for example, to mark questions. The affected prompts to be repeated therefore had to be tracked for presentation only to speakers of the appropriate gender.

There are three Indian-run radio stations with some broadcasts in Lakota which could provide some found speech. As broadcasts have rather different acoustic channel characteristics from a close-talking microphone, these data are not ideal for direct use in training acoustic models for a language-learning application. The speech style is also very fast, with some fast-speech phonological phenomena which do not occur in more formal speech and might not be expected to be used by learners. However, this speech can be segmented into short sentences or phrases and presented as models for repetition by speakers who are able to “undo” the fast-speech phenomena.

236

REQUIREMENT: LEXICON OF PHONETIC FORMS

Technological Desiderata

A speech recognizer can only recognize words it already knows exist. A phone-based recognizer must have phonetic transcriptions for each lexical item, and a lexicon is a convenient place to map between orthographic and phonetic representations, if the orthography is not phonetic; it needs to contain only a transcription and an orthography if different. A lexicon does not require glosses or other nonacoustic information. Inflected forms should be included since speech recognizers typically handle them as different words.

Real-world Limitations

In languages with available electronic text and with word boundaries explicitly marked, a large electronic word list is easy to create. The word list will usually need to be reviewed for misspellings and other errors. However, in some languages word boundaries are not explicitly marked in text, as the case in Japanese, or the text may not be consistent in whether clitics are written as separate words or not, as in Lakota and Pashto. For many languages, there is little or no electronic text from which to draw lexical items or little electronic text containing vocabulary items that language learners are expected to use in the course of study. Morphological variants may need to be added to those found in source texts so that the lexicon covers all the inflected forms desired for each root. For some languages, the orthography does not reflect the phonological structure of the words, and for very few of these languages are lexica with phonetic transcriptions easily available.

In languages which do have one or more phonetic lexica, transcriptions may vary in style and convention and may need to be mapped to the single transcription standard chosen for use in developing the speech recognizer. Defining all the conventions required for consistent phonetic transcription is often not a straightforward task. If several transcription styles are in use, the correspondences between them frequently do not allow a completely automatic mapping from one to another. For example, a transcription of North American English that includes a specific symbol for an alveolar flap cannot be easily mapped into another that uses only /t/ and /d/, nor vice versa if the transcription with only /t/ and /d/ does not explicitly indicate stress.

Approaches

The lexicon can use words found in transcriptions of speech recordings as a base, and lexical items can be added as needed. Language learners, especially learners at less advanced levels, are often expected to use only a rather limited vocabulary, and this vocabulary may sometimes be taken from teaching materials which accompany the software or are used in tandem with it. This approach will serve the purposes of matching the vocabulary to the skill level of the learner, as well as

237

to the style or topic areas of interest. For example, learners studying a language for use in business settings may not be familiar with quite the same vocabulary as those studying the same language for leisure travel. Inflected variants can be entered by hand or, depending on the language, generated automatically with a small amount of programming.

In many cases an automatic mapping can be found between orthographic and phonetic representations, using some form of letter-to-sound rules. This automatic mapping can greatly reduce the human effort required, even if the mapping is not completely correct, by changing the human's task from creating transcriptions to editing and verifying them. However, it can also introduce certain kinds of errors that a human may overlook in reviewing proposed transcriptions, so automatic mappings must be treated with care.

Notes from the Field

In Pashto, the creation of a lexicon was rendered considerably more onerous by the absence of a standardized written form of the language, as discussed in the next section. A suitable phonemic inventory first had to be derived, requiring difficult decisions about numerous vowels described in the literature as being “of uncertain phonemic status” (Elfenbein, 1997). Since speakers from northern dialect areas have several consonant mergers relative to speakers from southern areas, and some southern speakers have fewer vowels than northern speakers, it was difficult for any individual speaker to create reliable dialect-independent transcriptions alone. Handling clitics and the words they attach to consistently as separate or single lexical items also posed challenges to native speakers working on the project.

REQUIREMENT: ORTHOGRAPHIC AND WRITING CONVENTIONS

Technological Desiderata

A standardized, conventional written representation of a language is taken for granted in many circles. So unquestioned is the presupposition of such a representation that its many ramifications are not even noticed until no representation is available, and the representation is revealed to be a cornerstone of automatic speech and language processing techniques whose removal may cause the edifice to crumble. A written representation is necessary during the building of a speech recognizer and for use internally within an application calling upon that technology. It should be noted that this representation is independent of the design decision made by an educator in choosing whether to present a written form of the language to learners or to expose them only to nontextual representations.

Perhaps the most important use of a written representation is in transcribing speech preparatory to estimating the parameters of statistical acoustic models, or training the models. An acoustic model of a phoneme or triphone is trained from audio segments which are known to contain the phoneme or triphone. Thus, audio files must have corresponding files indicating the content of the audio, and

238

this content must be either in phonetic transcription or directly mappable to a phonetic transcription.

A written representation is also needed for naming acoustic models and for the phonetic lexicon, and it can be an efficient way of eliciting speech for training acoustic models and thus avoiding or minimizing the need for subsequent transcription. In addition, rather large quantities of text are often used to train statistical language models, which assign probabilities to sequences of words according to their likelihood of occurrence and thereby greatly increase recognition accuracy relative to a system that does not make use of word sequence constraints. Such language models are commonly used in large-vocabulary speech recognizers, which allow relatively free, dictation-like input, as might be required for applications targeted to more advanced learners or for practice of oral expressive skills in open tasks.

The written representation must reflect phonemic distinctions which a phone-based recognizer is expected to make. In many or perhaps all languages, not all words have the same phonemic representation in all dialects (e.g., in English, “either” can have vowels of “eager” or “eider,” and “caught” may or may not be homophonous with “cot”). A phonemic inventory must be chosen and decisions made about what pronunciations are predictable from context and what must be explicitly described.

Ideally, it should be easy to convert the written representation into an electronic form, separate the text according to convention into consistent units (e.g., words), and automatically process it in a variety of ways.

Real-world Limitations

Many languages are primarily or essentially wholly unwritten, are read and written by only a fraction of their speakers, or are written but not in any widely accepted standardized form. In this last case, not only may two identical strings of characters represent different words, but two or more different strings of characters may represent the same word. The former phenomenon of homographs is familiar to speakers of major languages, but the latter is new to most technologists. When a single word corresponds to an unknown set of different strings of characters, any automatic processing can be very problematic.

There are languages in which orthographic and writing conventions are highly politicized. The politics of writing are evidenced to a small extent in English, where not observing these conventions can be stigmatised as an indicator of carelessness or poor education. Difficulties can arise when there are several competing conventions.

In some cases a written representation exists and fulfils many but not all of the technological requirements. For example, while English writing conventions are highly standardized, the spelling often does not reflect the phonological structure well. In cases such as this, two representations can essentially be used in parallel. One is the conventional representation, and the other maps the conventional units into phonetic transcriptions.

239

The goals of a written representation for building a speech recognizer, outlined in the section on technological desiderata immediately above, are not entirely the same goals as those of an orthography of a language, which carries cultural and historical as well as synchronic linguistic information. Therefore, an orthography may not meet all the needs of a speech recognizer.

Approaches

Having no consistent existing orthography easily available poses many extremely difficult problems. A written representation has been attempted in many languages, whether or not it is widely known to speakers of the language, and a phonemic inventory can be derived through standard phonetic fieldwork methods. The inventory may in fact be more approximate than would satisfy theoretical goals since its purpose is purely practical. Acoustic data can be transcribed using a broad phonetic transcription, a previously designed orthography, both, or any modification of either. The most important goal is consistency, and transcribers may be trained to use whatever means allow them to most easily achieve consistent written representations. It should be noted that becoming explicitly aware of the phonological structure of a language is probably rarely a trivial task and seems to be extremely difficult for some speakers.

In language learning, there is usually a target accent which learners are attempting to emulate in their own speech and which must be represented. The phonemic inventory chosen may thus not necessarily need to represent as general a form of the language as is usually the case in other speech recognition applications, and determining a suitable phonemic inventory may be somewhat simplified.

Collecting data from speakers without using written presentation of prompt sentences is relatively easy to solve. If large numbers of speakers who will be recorded are not familiar either with an existing orthography or with one created for use in the speech technology development effort, they may be asked to repeat after oral prompts. These prompts may be prerecorded by a single speaker or small number of speakers according to availability. The prerecorded prompts are not sufficient in themselves for training acoustic models, unfortunately, because they display limited speaker diversity. Aural prompts may also be taken from found recordings, which again often display shortcomings making them less than ideal for training acoustic models.

In many language-learning applications, learner input is sufficiently limited in choices that handwritten recognition grammars may substitute for statistical language models and provide the needed constraints for good recognition performance. Thus, explicit human knowledge of the language and the behavior of target learners of the language may be used instead of large quantities of text.

Notes from the Field

While Pashto has a written form, the literacy rate among Pashto speakers is low. In addition, as Tegey and Robson (1996) have noted,

240

individual writers vary widely in spelling and punctuation. Words are frequently spelled differently, not only from one writer to another, but often by the same writer, and even within the same document. Even such matters as spacing between words are not consistent.

Advertising for Pashto speakers to transcribe speech recordings yielded responses from a number of people who explained they were unable to write the language, and that, coupled with the expected variability in writing, led us to begin to teach some transcribers to write directly in a phonemic representation. Teaching an explicit awareness of the phonology was more time consuming and difficult than foreseen and was further complicated by the substantial variability among dialects in inventory (Elfenbein, 1997) and resulting disagreements in native speaker perception. Knowledge of written Pashto, which uses an Arabic-based script and does not indicate or disambiguate most vowels, had both a helpful and a harmful influence.

Later, we asked other transcribers to write in native orthography; there, reconciling the different personal conventions presented a challenge. As one example of an unforeseen problem, after a certain amount of work had been completed, one transcriber was found to have been using a single, less frequent character to represent both itself and another, more frequent character because of an inconvenient mapping of characters to keys. A similar substitution in English, for instance writing “k” for both “k” and “p,” would simply not occur to most English speakers; contemporary English speakers do not usually think of spellings as theirs to adapt to each situation. This substitution gives a small idea of the gulf between standardized and nonstandardized writing traditions. Transcription of Pashto has proven to take at least an order of magnitude more time than would be expected for a language with well established conventions.

In a number of languages, two or more orthographies may compete with each other. For example, Serbo-Croatian may be written in either the Roman or Cyrillic alphabets; both are standardized and equally effective at representing the sounds of the language, and both satisfy the technologist's needs. However, the choice between them carries political meaning of which the technologist must be or soon will be aware. The technologist may be forced into taking a political position without having any intention or desire to do so.

SUMMARY

The historical focus of speech technology development on a small number of languages rich in a variety of linguistic resources has had a strong influence on the technical approaches used. As the vast majority of the world's languages do not present equally large stores of resources, electronic or otherwise, other approaches must be explored for these languages. Fortunately, the requirements of a language-learning application are narrower than those of a more general-purpose speech recognizer, and this fact can be exploited to help meet some of the challenges posed by a non-mainstream linguistic context. Technical requirements

241

can be negotiated to a certain extent to match available resources, though the resulting performance must be monitored.

A very close collaboration between the technologist and language expert or teacher is most helpful when the language is not highly familiar to the technologist. A teacher is likely to have reflected more on the language than a native speaker with no special background and to have adopted some standards for her or his own use, in spelling, phonetic transcription or understanding, target accent and pronunciation, and so on. A language expert or teacher also plays a crucial role in explaining the context in which the language is situated to the technologist. It is probably impossible to separate technological issues from political and cultural considerations. Some of the challenges which may be encountered have been mentioned here, and the time and effort required to meet them should not be underestimated.

When the linguistic context departs from the resource-rich norm of “major” languages in several ways, as is often the case in non-mainstream languages, the difficulties compound. Creativity and compromises may be required to achieve the goal of speech technology that supports the teaching and learning of less commonly taught languages.

REFERENCES

Anastasakos, A. (1997). Speaker normalization methods for speaker independent speech recognition. Unpublished doctoral dissertation, Northeastern University.

Claes, T., Dologlou, I., ten Bosch, L., & Van Compernolle, D. (1997). New transformations of cepstral parameters for automatic vocal tract length normalization in speech recognition. In G. Kokkinakis, N. Fakotakis, E. Dermatas (Eds.), Proceedings of Eurospeech 97 (pp. 1363-1366). Rhodes, Greece.

Elfenbein, J. (1997). Pashto phonology. In A. S. Kaye (Ed.), Phonologies of Asia and Africa, Vol. 2 (pp. 733-760). Winona Lake, IN: Eisenbrauns.

Graff, D. (2001, October). Talk at the Defense Advanced Research Projects Agency (DARPA) Babylon Workshop, Santa Monica, CA.

Grimes, B. F., & Grimes, J. E. (Eds.) (2002). Ethnologue: Languages of the world, 14th edition. Dallas, TX: SIL International.

Schultz, T., & Waibel, A. (2001). Language-independent and language-adaptive acoustic modelling for speech recognition. Speech Communication, 35, 31-51.

Tegey, H., & Robson, B. (1996). A reference grammar of Pashto. Washington, DC: Center for Applied Linguistics.

ACKNOWLEDGMENTS

We thank Raymond J. DeMallie, Douglas Parks, and Kristin Alten of the American Indian Studies Research Institute, Indiana University, Colleen Richey, and our Lakota and Pashto consultants. This work was partially supported by the U.S. Government under SPAWAR Contract N66001-99-D-8504. The views expressed here do not necessarily reflect those of the Government.

242

AUTHOR'S BIODATA

Kristin Precoda is Director of the Speech Technology and Research Laboratory at SRI International. She holds a Ph.D. in Electrical Engineering, an M.A. in Linguistics, and an M.S. in Statistics and has worked in several areas of speech science and technology. She is currently leading projects in spoken language translation and in speech technology for language learning (www. eduspeak.com).

AUTHOR'S ADDRESS

Kristin Precoda

Speech Technology and Research Laboratory

SRI International

333 Ravenswood Ave. (EJ101)

Menlo Park, CA 94025

Phone: 650/859-2388

Fax: 650/859-5984

Email: precoda@speech.sri.com

Retrieved on December 20, 2010 from https://www.calico.org/memberBrowse.php?action=article&id=205

^^IKE's_Zone^^

Senin, 20 Desember 2010

Literature: Non-mainstream Languages and Speech Recognition: Some Challenges

Tidak ada komentar:

Posting Komentar

Cari Blog Ini

me n my dolls