Overview

While state-of-the-art diarization systems perform remarkably well for some domains (e.g., conversational telephone speech such as CallHome), as was discovered at the 2017 JSALT Summer Workshop at CMU, this success does not transfer to more challenging corpora such as child language recordings, clinical interviews, speech in reverberant environments, web video, and “speech in the wild” (e.g., recordings from wearables in an outdoor or restaurant setting). In particular, current approaches:

fare poorly at estimating of the number of speakers (e.g., monologues are frequently broken into multiple speakers)
fail to work for short utterances (<1 second), which is particularly problematic for domains such as clinical interviews, which contain many short segments of high information content
deal poorly with child speech and pathological speech (e.g., due to neurodegenerative diseases)
are not robust to materials with large amounts of overlapping speech or dynamic environmental noise with some speechlike characteristics

The goals of the inaugural DIHARD evaluation include:

to create an evaluation set drawn from a diverse set of challenging domains
to establish a baseline of performance for existing diarization technologies on this set
to release the reference data and result for continued research after the evaluation to encourage further testing and development

Task

The goal of the challenge is to automatically detect and label all speaker segments in each audio recording. Small pauses of <= 200 ms by a speaker are not considered to be segmentation breaks and should be bridged into a single continuous segment. Vocal noises other than breaths (e.g., laughter, cough, sneeze, and lipsmack), are considered to be speech for the purpose of this evaluation, though all other sounds are considered non-speech. Because system performance is strongly influenced by the quality of the speech segmentation used, two tracks will be supported:

Track 1: diarization using gold speech segmentation
Track 2: diarization from scratch

Systems submitted to the former track should use the provided reference speech segmentation for each file, which will allow for evaluation of the diarization component in isolation from the SAD component. Systems submitted to the latter track will work directly from the audio. All researchers are strongly encouraged to submit results to at least the first track.

Scoring

System output will be scored by comparison to human reference segmentation with performance evaluated by two metrics:

diarization error rate (DER)
framewise mutual information (MI)

1. Diarization error rate

Diarization error rate (DER), introduced for the NIST Rich Transcription Spring 2003 Evaluation (RT-03S), is the total percentage of reference speaker time that is not correctly attributed to a speaker, where “correctly attributed” is defined in terms of an optimal one-to-one mapping between the reference and system speakers. More concretely, DER is defined as:

$$\textrm{DER} = \frac{\textrm{FA} + \textrm{MISS} + \textrm{ERROR}}{\textrm{TOTAL}}$$

where

TOTAL is the total reference speaker time; that is, the sum of the durations of all reference speaker segments
FA is the total system speaker time not attributed to a reference speaker
MISS is the total reference speaker time not attributed to a system speaker
ERROR is the total reference speaker time attributed to the wrong speaker

Contrary to practice in the NIST evaluations, NO forgiveness collar will be applied to the reference segments prior to scoring and overlapping speech WILL be evaluated. For more details please consult section 6 of the RT-09 evaluation plan and the source to the NIST md-eval scoring tool, available as part of the Speech Recognition Scoring Toolkit (SCTK). For DIHARD, we will be using version 22 of md-eval.

2. Mutual information

We also approach system evaluation from the standpoint of clustering evaluation, where both the reference and system segmentations are viewed as assignments of labels to frames of speech and a system's score is the mutual information in bits between its labeling and the reference labeling. More concretely, each segmentation will be converted to a sequence of 10 ms frames, each of which is assigned a single label corresponding to one of the following classes:

non-speech
non-overlapping speech by speaker_i
overlapping speech by n speakers speaker_i₁, ..., speaker_{i_n}

where the sets of speakers are assumed disjoint for any pair of files. The contingency matrix between the reference and system labelings is then built and from this the mutual information computed according to:

$$\textrm{MI} = \sum_{i=1}^{R}\sum_{j=1}^{S}\frac{n_{ij}}{N}\log_2{\frac{n_{ij}N}{r_is_j}}$$

where

R is the number of reference clusters
S is the number of system clusters
n_ij is the number of frames assigned to the i-th reference cluster and j-th system cluster
r_i is the number of frames assigned to the i-th reference cluster
s_j is the number of frames assigned to the j-th system cluster
N is the total number of frames

3. Scoring regions

The scoring region for each recording is the entirety of the recording; that is, for a recording of duration 405.37 seconds, the scoring region will be [0, 405.37]. These regions are provided to the scoring tool via un-partitioned evaluation map (UEM) files, which are plaintext files containing one scoring region per line, each line consisting of four space-delimited fields

File ID -- file name; basename of the recording minus extension (e.g., “rec1_a”)
Channel ID -- channel (1-indexed) that scoring region is on
Onset -- onset of scoring region in seconds from beginning of recording
Offset -- offset of scoring region in seconds from beginning of recording

For instance:

              
                CMU_20020319-1400_d01_NONE 1 125.000000 727.090000
                CMU_20020320-1500_d01_NONE 1 111.700000 615.330000
                ICSI_20010208-1430_d05_NONE 1 97.440000 697.290000

UEM files for the dev/eval partitions:

4. Scoring tool

The official scoring tool is maintained as a github repo: https://github.com/nryant/dscore. For results comparable to those obtained during the challenge, please checkout the repo to v1.01.

To score a set of system output RTTMs sys1.rttm, sys2.rttm, ... against corresponding reference RTTMs ref1.rttm, ref2.rttm, ... using the un-partitioned evaluation map (UEM) dev.uem, the command line would be:

              
                $ python score.py -u dev.uem -r ref1.rttm ref2.rttm ... -s sys1.rttm sys2.rttm ...

The overall and per-file results for DER and MI (and many other metrics) will be printed to STDOUT as a table. For additional details about scoring tool usage, please consult the documentation for the github repo.