The 2000 NIST Evaluation Plan for Recognition of Conversational Speech over the Telephone

Version 1.3, 24-Jan-00


Introduction

The 2000 evaluation of conversational speech recognition over the telephone is part of an ongoing series of periodic evaluations conducted by NIST. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of conversational speech recognition. To this end the evaluation was designed to be simple, to focus on core speech technology issues, to be fully supported, and to be accessible.

This year's evaluation is on conversational telephone data in English, Spanish and Mandarin. Word error rate continues to be the primary evaluation metric. New to this evaluation, however, is a phone level recognition component for English, as described below.

The 2000 evaluation will be conducted in March. Data will go out on March 1, with results due back to NIST by March 31. This year the English evaluation will also have a non-competitive diagnostic component for which phone level results are required. The data for this component will be made available to each site shortly after it commits to participation in the evaluation, and output may be returned to NIST as soon as the site has processed this data. A follow-up workshop for participants in this evaluation and the broadcast news evaluation will be held May 16-19 to discuss research findings.

Participation in this evaluation is solicited for all sites that find the task worthy and the evaluation of interest. Sites may choose to participate in the transcription task for any one, any two, or all three of the evaluation languages. For more information, and to register a desire to participate in the evaluation, please contact Dr. Alvin Martin at NIST, alvin.martin@nist.gov. Please note that the commitment deadline for participation is February 15, 2000.


Technical Objective

The Hub-5 evaluation focuses on the task of transcribing conversational speech into text. This task is posed in the context of conversational telephone speech in General American English, in Spanish, and in Mandarin. The evaluation is designed to foster research progress, with the goals of

  1. exploring promising new ideas in the recognition of conversational speech,
  2. developing advanced technology incorporating these ideas, and
  3. measuring the performance of this technology.


The Task

The task is to transcribe conversational speech, which is presented as a set of conversations or parts of conversations collected over the telephone. The speech data is represented as a "4-wire" recording, that is, with two distinct sides, one from each end of the telephone circuit. Each side is recorded and stored as a standard telephone codec signal (8 kHz sampling, 8-bit mu-law encoding).

The speech data is represented as a sequence of "turns", where each turn is the period of time when one speaker is speaking. Each successive turn results from a reversal of speaking and listening roles for the conversation participants. The transcription task is to produce the correct transcription for each of the specified turns. The beginning and ending times [1] of each of these turns will be supplied as side information to the system under test. This information, stored in a single PEM file [2], will determine the test material.

As noted, the English evaluation will have a diagnostic component, consisting of individual conversational turns, for which sites are asked to produce transcripts at the phone level, first without and then with advance knowledge of the reference word sequence.

Speech Data

Transcription Conventions

The American Heritage Dictionary (AHD) [3]will serve as the standard reference for word spellings in English. Words that don't occur in the AHD will be spelled using the most common accepted spelling. The official lexicons supplied by the Linguistic Data Consortium will define the standard spellings in Spanish and Mandarin.

Hesitation sounds, referred to as "non-lexemes", will be represented with a leading "%" character. Although these sounds are transcribed in a variety of ways due to highly variable phonetic quality, they are all considered to be functionally equivalent from a linguistic perspective.

Training Data

The following may be used for English language training:

These corpora are available from the LDC. Additional English data may also be used for training, provided that the data are publicly available at the time of reporting results.

In Spanish and Mandarin the following may be used for training:

These corpora are available from the LDC. Additional Spanish and Mandarin data may also be used for training, provided that the data are publicly available at the time of reporting results.

Development Data (the DevSet)

The September 1998 Call_Home English EvalSet and Switchboard-2 Phase-2 EvalSet will serve as the twin English DevSets, each containing 20 conversations. Segment time marks (STM) and corresponding standard normal orthographic representation (SNOR) transcriptions for these data will be provided in standard STM [4] format. The file names for these 40 conversations are as follows:

CALL_HOME ENGLISH

1. en_4289

6. en_4753

11. en_4902

16. en_6130

2. en_4290

7. en_4790

12. en_4908

17. en_6165

3. en_4316

8. en_4802

13. en_4968

18. en_6217

4. en_4522

9. en_4824

14. en_5020

19. en_6474

5. en_4726

10. en_4875

15. en_5023

20. en_6880

 

SWITCHBOARD-2 PHASE-2

1. sw_20017

6. sw_20663

11. sw_21545

16. sw_22944

2. sw_20136

7. sw_21004

12. sw_21567

17. sw_23490

3. sw_20189

8. sw_21060

13. sw_22228

18. sw_23686

4. sw_20388

9. sw_21498

14. sw_22554

19. sw_24378

5. sw_20414

10. sw_21538

15. sw_22901

20. sw_24507



In addition, subsets of each of these DevSets are defined for the purpose of facilitating exchange of research results between sites. These consist of the first 30 seconds of speech (within the 5-minute evaluation excerpt) from each conversation side. (The elapsed time will be 30 seconds or more, in order to capture 30 seconds of actual speech. In some cases, the elapsed time may reach the limit of 5 minutes if a speaker mostly listens.) Segment time marks and corresponding transcriptions for these subsets will be provided by NIST in standard STM format

The 1997 Call_Home Spanish and Mandarin EvalSets (20 conversations in each language) will serve as the DevSets for the respective languages. Segment time marks (STM) and corresponding SNOR transcriptions for these data will be provided in standard STM [5] format

Evaluation Data (the EvalSets)

Evaluation data is provided separately for each of the three languages. Each Evalset is described below.

English Evaluation Data:

The English evaluation data will consist of three distinct sections. The first two are similar to the data in previous evaluations, while the third is the data for the non-competitive diagnostic component of the evaluation.

EvalSet-1

The first section will contain 20 conversations from the Call_Home corpus.

EvalSet-2

The second section will contain 20 conversations that were collected for the Switchboard Corpus but not included in the original release. Most of the speakers in these conversations, however, also appear in the released Switchboard Corpus

EvalSet-3

The third section may contain one or more whole conversations, along with hundreds of separate turns form the conversations from the released Switchboard Corpus. These turns will be provided as separate speech files. The total speech duration of this data will be approximately one hour. All of this data has been or will be chosen and phonetically transcribed by ICSI [6].

It should be noted that all of this data is contained in the data available for training. As noted, this part of the English evaluation is included for diagnostic, rather than competitive, purposes.

The English evaluation data will be provided on two separate CDROMs. One will contain the 40 conversations of the first two sections in their entirety, but systems will be tested only on the 5-minute excerpt from each conversation chosen and transcribed by the LDC. The other CDROM will contain the data for the diagnostic part of the evaluation. This CD will also contain the reference transcripts for the data included.

Speaker turn segmentation information for all of the EvalSet will be supplied to guide the recognition system. This segmentation information will be supplied in NIST's PEM file format.

While the data on the CD with 40 whole conversations will come from two different sources, the identity of the source is not to be provided to the system under test. The system must either recognize the speech irrespective of the source or must automatically determine the source from examination of the speech signal.

Spanish Evaluation Data:

The Spanish EvalSet will consist of 20 conversations from the Call_Home corpus and will be distributed on a separate evaluation CDROM. Speaker turn segmentation information will be supplied to guide the recognition system. This segmentation information will be supplied in NIST's PEM file format.

Mandarin Evaluation Data:

The Mandarin EvalSet will consist of 20 conversations from the Call_Home corpus and will be distributed on a separate evaluation CDROM. Speaker turn segmentation information will be supplied to guide the recognition system. This segmentation information will be supplied in NIST's PEM file format.

The Evaluation

Each system will be evaluated by measuring that system's word error rate (WER). (For Mandarin the measure will be character error rate (CER), and all references to words in what follows should be taken as referring to characters.) Each system will also be evaluated in terms of its ability to predict recognition errors. System performance will be evaluated over an ensemble of conversations and parts of conversations. Performance will be separately evaluated for each language. The content of each of these sources is being chosen, to the extent possible, to represent a statistical sampling of conditions of evaluation interests. These conditions will include sex, geographical distribution, and age. The Switchboard data marked at the phone level is also being chosen to represent a variety of phonetic and prosodic features of interest, but NIST will not publish word error rate results for this data.

The Reference Transcription

The reference transcriptions are intended to be as accurate as possible, but there will necessarily be some ambiguous cases and outright errors. In view of the existing high error rates of automatic recognizers on this type of data, it is not considered cost effective to generate multiple independent human transcriptions of the data or to have a formal adjudication procedure following the evaluation submissions.

The reference transcription for each turn will be limited to a single sequence of words. This word sequence will represent the transcriber's best judgment of what the speaker said.

Word fragments will be represented by an initial part of a word with a hyphen at the end. Correct recognition will consist of either ignoring the fragment, or producing a word of which the fragment is an initial part.

The reference transcription will contain no hyphenated words. Each hyphenated word will be separated into its separate constituent words

The WER (CER for Mandarin) Metric

Word error rate is defined as the sum of the number of words in error divided by the number of words in the reference transcription. The words in error are of three types, namely substitution errors, deletion errors, and insertion errors. Identification of these errors results from the process of aligning the words in the reference transcription with the words in the system output transcription. This alignment is performed using NIST's SCLITE software package [7].


Scoring will be performed by aligning the system output transcription with the reference transcription and then computing the word error rate. Alignment will be performed independently for each turn. The system output transcription will be processed to match the form of the reference transcription. Hyphenated words will be separated into their separate constituent words.

A few variant spellings of the same word exist in the English transcriptions. These words, with or without hyphens, will be mapped onto a single preferred word spelling without hyphens. The set of all such mappings is:

Input

Output

mhm

uhhuh

mmhm

uhhuh

mm-hm

uhhuh

mm-huh

uhhuh

huh-uh

uhuh

 

For scoring purposes, all hesitation sounds will be considered to be equivalent. Thus all reference transcription words beginning with "%", the hesitation sound flag, along with the conventional set of hesitation sounds, will be mapped to "%hesitation".

The system output transcriptions should use any of the hesitation sounds (without "%") when a hesitation is hypothesized. The set of English hesitation sounds for the current evaluation is defined to be:

"uh", "um", "eh", "mm", "hm", "ah", "huh", "ha", "er", "oof", "hee", "ach", "eee" and "ew".

The sets of hesitation sounds for Spanish and Mandarin may be found at 1997 evaluation website:

ftp://jaguar.ncsl.nist.gov/evaluations/hub5ne/sept97/datafiles/current_datafiles.htm.

As noted previously, for Mandarin character error rate will be used in place of word error rate. Furthermore, confidence scores, as discussed below, will be applied at the character level. If the system output gives confidences only at the word level, the word level values will be automatically imputed to characters making up the word.

The Phone Level Transcription

The diagnostic part of the evaluation will use data from the released Switchboard Corpus for which phone level transcriptions were generated previously at ICSI [6], plus some additional data that is being phonetically and prosodically transcribed specifically for this evaluation. Phone level submissions from participating sites will allow analysis after the evaluation of how various types of phonetic and prosodic events may affect recognizer performance.

Systems are asked for two types of output from this data. The first is output from ordinary one-best word level recognition that includes phones and their time marks in addition to the standard word level output. The second is phone level output resulting from forced recognition using the reference word transcripts provided with the data. The phone level output may be in the format suggested in the Submission of Results section below, but may be in an alternative documented format if that is more convenient for the participating site.

Sites should use for this part of the evaluation a system as similar as reasonably possible to that used for the rest of the English evaluation, but some system differences are acceptable. Since there will be only a limited amount of data for each speaker in this part of the evaluation, sites may, for example, choose not to use speaker normalization or adaptation as used elsewhere in the evaluation.

As noted previously, this part of the evaluation is intended for diagnostic and not competitive purposes. Participating sites should send to NIST, along with their statement of intention to participate in the English evaluation, a full definition of their phone sets to be used in recognition. NIST will then send them the single CD-ROM with the data for this part of the evaluation and the corresponding reference transcripts. Sites are the encouraged to return to NIST the two types of output transcriptions for this data as soon as possible.

The Confidence Measure

Along with each word output by a system, a confidence score is also required. This confidence score is the system's estimate of the probability (in the range [0,1]) that the word (or character for Mandarin) is correct. While this might be merely a constant probability, independent of the input, certain applications and operating conditions may derive significant benefit from a more informative estimate that is sensitive to the input signal. This benefit will be evaluated by computing a normalized cross entropy (NCE) measure consisting of the mutual information (cross entropy) between the correctness of the system's output word and the confidence score output for it, normalized by maximum cross entropy:

equation

where,


In addition, as in the past, NIST will use the likelihood scores along with the hypothesized words to create a DET (Detection Error Tradeoff) type curve for each set of results submitted. This will be a plot of


as the likelihood score is varied and used as a threshold for determining whether hypothesized words should be included.

For the phone level output, sites may provide confidence scores for each phone. Sites may, if they prefer, provide phone level likelihood scores that are not intended as probabilities and are not constrained to the range [0,1], but which do order the degree of certainty of correct phone recognition. Meaningful scores at the phone level may assist in the diagnostic analysis to be made of these submissions.

Submission of Results

Initial results on the phonetically transcribed data must be submitted to NIST by 7:00 PM. EST on March 15, 2000. All other results must be submitted by 7:00 PM EST on March 31, 2000. Sites should submit results using the following steps:

    1. system output file creation,
    2. directory structure creation,
    3. system documentation, including execution times, and system output inclusion
    4. transmission protocol to NIST.


Step 1: System output file creation

The time-marked hypothesis tokens for each test will be placed in a single file, called "<TEST_SET>.ctm", where <TEST_SET> is the base name of the associated PEM file. (For the diagnostic task, there will be a word-level CTM file as well as a phone-level CTM file.) The CTM (Conversation Time-Mark) file format is a concatenation of time marks for each hypothesized token in each side of a conversation. Each hypothesized token must have a conversation id, channel identifier [A | B], start time, duration, case-insensitive text, and a confidence score. The start time must be in seconds and relative to the beginning of the waveform file. The conversation id's for this evaluation will be of the form:

CONV_ID::= <SWB2_ID> | <CALLHOME_ID>>|<SWITCHBOARD_TYPE_ID>

where,

SWB2_ID ::= sw_DDDDD (where DDDDD is a five digit conversation code)
CALLHOME_ID ::= en_DDDD (where DDDD is a four digit conversation code)
SWITCHBOARD_TYPE_ID ::= swDDDD (where DDDD is a four digit conversation code)

For the Mandarin evaluation, sites that choose to supply confidence scores at the character level must create a separate CTM record for each character. Otherwise, confidence scores for multi-character words will be imputed to all characters.

The file must be sorted by the contents of the first three columns: the first and the second in ASCII order, the third in numeric order. The UNIX sort command: "sort +0 -1 +1 -2 +2nb -3" will sort the words into appropriate order.

Lines beginning with ';;' are considered comments and are ignored. Blank lines are also ignored.

Included below is an example:

;;
;; Comments follow ';;'
;;
;; The Blank lines are ignored

;;
en_7654 A 11.34 0.2 YES -6.763
en_7654 A 12.00 0.34 YOU -12.384530
en_7654 A 13.30 0.5 CAN 2.806418
en_7654 A 17.50 0.2 AS 0.537922
:
en_7654 B 1.34 0.2 I -6.763
en_7654 B 2.00 0.34 CAN -12.384530
en_7654 B 3.40 0.5 ADD 2.806418
en_7654 B 7.00 0.2 AS 0.537922
:


Step 2: Directory Structure Creation

Create a directory identifying your site ('SITE') which will serve as the root directory for all your submissions. Examples:

You should place all of your recognition test results in this directory. When scored results are sent back to you and subsequently published, this directory name will be used to identify your organization.

For each test system, create a sub-directory under your 'SITE' directory identifying the system's name or key attribute. The sub-directory name is to consist of a free-form system identification string 'SYSID' chosen by you. Place all files pertaining to the tests run using a particular system in the same SYSID directory.

The following is the BNF directory structure format for Hub-5 hypothesis recognition results:

<SITE>/<SYSID>/<FILES>

where

SITE ::= bbn | dragon | ibm | sri | . . .
SYSID ::= (short system description ID, preferably <= 8 characters)
FILES ::=

sys-desc.txt :: system description, including reference to paper if applicable


<TEST_SET>.ctm :: file containing time-marked hypothesis word.

where

TEST_SET ::= base name of the corresponding PEM file.

Step 3: System Documentation, including execution times, and System Output Inclusion

For each test you run and for each system evaluated, a brief description of the system (the algorithms) used to produce the results must be provided along with the results. (It is permissible for a single site to submit multiple systems for evaluation. In this case, however, the submitting site must identify one system as the "primary" system prior to performing the evaluation.)

The format for the system description is as follows:

SITE/SYSTEM NAME
TEST DESIGNATION

      1. Primary Test System Description:
      2. Acoustic Training:
      3. Grammar Training:
      4. Recognition Lexicon Description:
      5. Differences for each Contrastive Test: (if any contrastive test were run.)
      6. New Conditions for This Evaluation:
      7. Execution Time:
      8. Sites must report the CPU execution time that was required to process the test data, as if the test were run on a single CPU. Sites must also describe the CPU and the amount of memory used.

      9. References:

Step 4: Test Results Submission Protocol

Once you have structured all of your recognition results according to the above format, you can then submit them to NIST. Due to international e-mail file size restrictions, test sites are permitted to submit results to NIST using either email or anonymous ftp. Continental US sites may use either method, but international sites must use the 'ftp' method. The following instructions assume that you are using the UNIX operating system. If you do not have access to UNIX utilities or ftp, please contact NIST to make alternate arrangements.

E-mail method:

First change directory to the directory immediately above the <SITE> directory. Next, type the following:

tar -cvf - ./<SITE> | compress | uuencode <SITE>-<SUBM_ID>.tar.Z | \
mail -s "March 2000 Hub-5 test results <SITE>-<SUBM_ID>" \
alvin.martin@nist.gov

where

<SITE> is the name of the directory created in Step 2 to identify your site.

<SUBM_ID> is the submission number (e.g. your first submission would be numbered '1', your second, '2', etc.)

Ftp method:

First change directory to the directory immediately above the <SITE> directory. Next, type the following command.

tar -cvf - ./<SITE> | compress | <SITE>-<SUBM_ID>.tar.Z

where

<SITE> is the name of the directory created in Step 2 to identify your site. <SUBM_ID> The submission number (e.g. your first submission would be numbered '1', your second, '2', etc.)

This command creates a single file containing all of your results. Next, ftp to jaguar.ncsl.nist.gov giving the username 'anonymous' and your e-mail address as the password. After you are logged in, issue the following set of commands, (the prompt will be 'ftp'):

You've now submitted your recognition results to NIST. The last thing you need to do is send an e-mail message to Alvin Martin at 'alvin.martin@nist.gov' notifying NIST of your submission. Please include the name of your submission file in the message.

Note:

If you choose to submit your results in multiple shipments, please submit ONLY one set of results for a given test system/condition unless you've made other arrangements with NIST. Otherwise, NIST will programmatically ignore duplicate files.  

Schedule

Commitment Deadline

February 15, 2000

EvalSet Release

March 1, 2000

Phonetic Results Due

March 15, 2000 at 7:00 PM EST (midnight GMT)

Other Results Due

March 31, 2000 at 7:00 PM EST (midnight GMT)

Results Release

April 14, 2000

Workshop

May 16-19, 2000
University of Maryland University College
College Park, Maryland


[1] These turn time marks will be specified in seconds (to the nearest millisecond) and will completely encompass the turn. Thus alternate turns will overlap if the speakers talk over each other.

[2] The PEM ("partitioned evaluation map") file format is given in the SCLITE documentation available through NIST's web page (http://www.nist.gov/speech/software.htm). Each record contains 5 fields: < filename >, < channel ("A" or "B") >, < speaker ("unknown") >, < begin time > and < end time >.

[3] The American Heritage Digtionary of the English Language, Book and CD ROM. Published October 1994 by Houghton Mifflin. ISBN 0395711460.

[4] ,[5] STM stands for "segment timne marked". The STM file identifies time intervals along with the transcription for those intervals. At the time this document was prepared, the STM file format is documented in NIST's SCLITE scoring software distribution available via NIST's web page (http://www.nist.gov/speech/software.htm).

[6] See http://www.icsi.berkeley.edu/real/stp/index.html.

[7] SCLITE software is avialabe via NIST's web page (http://www.nist.gov/speech/software.htm).