Overview


While state-of-the-art diarization systems perform remarkably well for some domains (e.g., conversational telephone speech such as CallHome), as was discovered at the 2017 JSALT Summer Workshop at CMU, this success does not transfer to more challenging corpora such as child language recordings, clinical interviews, speech in reverberant environments, web video, and “speech in the wild” (e.g., recordings from wearables in an outdoor or restaurant setting). In particular, current approaches:

The goals of the inaugural DIHARD evaluation include:



Task


The goal of the challenge is to automatically detect and label all speaker segments in each audio recording. Small pauses of <= 200 ms by a speaker are not considered to be segmentation breaks and should be bridged into a single continuous segment. Vocal noises other than breaths (e.g., laughter, cough, sneeze, and lipsmack), are considered to be speech for the purpose of this evaluation, though all other sounds are considered non-speech. Because system performance is strongly influenced by the quality of the speech segmentation used, two tracks will be supported:

Systems submitted to the former track should use the provided reference speech segmentation for each file, which will allow for evaluation of the diarization component in isolation from the SAD component. Systems submitted to the latter track will work directly from the audio. All researchers are strongly encouraged to submit results to at least the first track.



Scoring


System output will be scored by comparison to human reference segmentation with performance evaluated by two metrics:


1. Diarization error rate

Diarization error rate (DER), introduced for the NIST Rich Transcription Spring 2003 Evaluation (RT-03S), is the total percentage of reference speaker time that is not correctly attributed to a speaker, where “correctly attributed” is defined in terms of an optimal one-to-one mapping between the reference and system speakers. More concretely, DER is defined as:


$$\textrm{DER} = \frac{\textrm{FA} + \textrm{MISS} + \textrm{ERROR}}{\textrm{TOTAL}}$$

where

Contrary to practice in the NIST evaluations, NO forgiveness collar will be applied to the reference segments prior to scoring and overlapping speech WILL be evaluated. For more details please consult section 6 of the RT-09 evaluation plan and the source to the NIST md-eval scoring tool, available as part of the Speech Recognition Scoring Toolkit (SCTK). For DIHARD, we will be using version 22 of md-eval.


2. Mutual information

We also approach system evaluation from the standpoint of clustering evaluation, where both the reference and system segmentations are viewed as assignments of labels to frames of speech and a system's score is the mutual information in bits between its labeling and the reference labeling. More concretely, each segmentation will be converted to a sequence of 10 ms frames, each of which is assigned a single label corresponding to one of the following classes:

where the sets of speakers are assumed disjoint for any pair of files. The contingency matrix between the reference and system labelings is then built and from this the mutual information computed according to:


$$\textrm{MI} = \sum_{i=1}^{R}\sum_{j=1}^{S}\frac{n_{ij}}{N}\log_2{\frac{n_{ij}N}{r_is_j}}$$

where

3. Scoring regions

The scoring region for each recording is the entirety of the recording; that is, for a recording of duration 405.37 seconds, the scoring region will be [0, 405.37]. These regions are provided to the scoring tool via un-partitioned evaluation map (UEM) files, which are plaintext files containing one scoring region per line, each line consisting of four space-delimited fields

  • File ID -- file name; basename of the recording minus extension (e.g., “rec1_a”)
  • Channel ID -- channel (1-indexed) that scoring region is on
  • Onset -- onset of scoring region in seconds from beginning of recording
  • Offset -- offset of scoring region in seconds from beginning of recording
For instance:
              
                CMU_20020319-1400_d01_NONE 1 125.000000 727.090000
                CMU_20020320-1500_d01_NONE 1 111.700000 615.330000
                ICSI_20010208-1430_d05_NONE 1 97.440000 697.290000
            

UEM files for the dev/eval partitions:

4. Scoring tool

The official scoring tool is maintained as a github repo: https://github.com/nryant/dscore. For results comparable to those obtained during the challenge, please checkout the repo to v1.01.

To score a set of system output RTTMs sys1.rttm, sys2.rttm, ... against corresponding reference RTTMs ref1.rttm, ref2.rttm, ... using the un-partitioned evaluation map (UEM) dev.uem, the command line would be:

              
                $ python score.py -u dev.uem -r ref1.rttm ref2.rttm ... -s sys1.rttm sys2.rttm ...
            
The overall and per-file results for DER and MI (and many other metrics) will be printed to STDOUT as a table. For additional details about scoring tool usage, please consult the documentation for the github repo.