Data

This first iteration of the DIHARD challenge selects data from multiple novel sources including previously unexposed data and data previously developed for purposes other than diarization. Annotations have been converted from other formats for some data sources and created anew for other sources.

1. Training data

DIHARD participants may use any data to train their system, whether publicly available or not, with the exception of the following previously released LDC corpora, from which portions of the evaluation set are drawn:

Portions of MIXER6 have previously been excerpted for use in the NIST SRE10 and SRE12 evaluation sets, which also may not be used.

All training data should be thoroughly documented in the system description document at the end of the challenge. Please also see the provided list of suggested training corpora.


2. Development data

The development set is available from LDC as LDC2019S09 and LDC2019S10. Speech samples are distributed alongside diarization and reference speech segmentation and may be used for any purpose including system development or training. These samples consist of approximately 19 hours worth of 5-10 minute chunks drawn from the following domains:

All samples are distributed as 16 kHz, mono-channel FLAC files.


3. Evaluation data

The evaluation set is available from LDC as LDC2019S12 and LDC2019S13. It consists of approximately 21 hours worth of 5-10 minute speech samples drawn from the same domains and sources as the development set with the following exceptions:

The domain from which each sample is drawn was not provided during the evaluation period.


4. Segmentation

Where transcription exists and forced alignment was feasible, initial segment boundaries were produced by refining the human marked boundaries with forced alignment by trimming of turn-initial/turn-final silence and splitting on pauses > 200 ms in duration, where for a given speaker, a pause is defined as any segment in which that speaker is not producing a vocalization. This includes breaths, but not coughs, laughs, or lipsmacks. In some cases, during the annotation process non-speech vocal noises were encountered that could not be accurately assigned to a speaker. All such segments have been omitted. Ideally, this segmentation was then checked and corrected by human annotators using a tool equipped with a spectrogam display. Where forced alignment was not possible, manually assigned segment boundaries were used. The reference speech-activity segmentation (SAD) was then derived from the diarization speaker-segment boundaries by merging overlapping segments and removing speaker identification.

Because this was an unfunded pilot project, created under time pressure by volunteers, the full three-step workflow (transcription, alignment, checking and correction by human annotators) could not be implemented for all sources. The situation for each source is as follows:


5. File formats

For each recording, speech segmentation is provided via an HTK label file listing one segment per line, each line consisting of three space-delimited fields:

For example:


    0.10  1.41  speech
    1.98  3.44  speech
    5.0   7.52  speech
        

Following prior NIST RT evaluations, diarization for recordings is provided using Rich Transcription Time Marked (RTTM) files. RTTM files are space-separated text files containing one turn per line, each line containing ten fields:

For instance:

              
                SPEAKER CMU 20020319-1400 d01 NONE 1 130.430000 2.350 <NA> <NA> juliet <NA> <NA>
                SPEAKER CMU 20020319-1400 d01 NONE 1 157.610000 3.060 <NA> <NA> tbc <NA> <NA>
                SPEAKER CMU 20020319-1400 d01 NONE 1 130.490000 0.450 <NA> <NA> chek <NA> <NA>
            

Data Resources for Training

This identifies a (non-exhaustive) list of publicly available corpora suitable for system training.


Corpora containing meeting speech

LDC corpora

  • ICSI Meeting Speech Speech (LDC2004S02)

  • ICSI Meeting Transcripts (LDC2004T04)

  • ISL Meeting Speech Part 1 (LDC2004S05)

  • ISL Meeting Transcripts Part 1 (LDC2004T10)

  • NIST Meeting Pilot Corpus Speech (LDC2004S09)

  • NIST Meeting Pilot Corpus Transcripts and Metadata (LDC2004T13)

  • 2004 Spring NIST Rich Transcription (RT-04S) Development Data (LDC2007S11)

  • 2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data (LDC2007S12)

  • 2006 NIST Spoken Term Detection Development Set (LDC2011S02)

  • 2006 NIST Spoken Term Detection Evaluation Set (LDC2011S03)

  • 2005 Spring NIST Rich Transcription (RT-05S) Evaluation Set (LDC2011S06)

Non-LDC corpora



Conversational telephone speech (CTS) corpora

LDC corpora

  • CALLHOME Mandarin Chinese Speech (LDC96S34)

  • CALLHOME Spanish Speech (LDC96S35)

  • CALLHOME Japanese Speech (LDC96S37)

  • CALLHOME Mandarin Chinese Transcripts (LDC96T16)

  • CALLHOME Spanish Transcripts (LDC96T17)

  • CALLHOME Japanese Transcripts (LDC96T18)

  • CALLHOME American English Speech (LDC97S42)

  • CALLHOME German Speech (LDC97S43)

  • CALLHOME Egyptian Arabic Speech (LDC97S45)

  • CALLHOME American English Transcripts (LDC97T14)

  • CALLHOME German Transcripts (LDC97T15)

  • CALLHOME Egyptian Arabic Transcripts (LDC97T19)

  • CALLHOME Egyptian Arabic Speech Supplement (LDC2002S37)

  • CALLHOME Egyptian Arabic Transcripts Supplement (LDC2002T38)

  • Switchboard-1 Release 2 (LDC97S62)

  • Fisher English Training Speech Part 1 Speech (LDC2004S13)

  • Fisher English Training Speech Part 1 Transcripts (LDC2004T19)

  • Arabic CTS Levantine Fisher Training Data Set 3, Speech (LDC2005S07)

  • Fisher English Training Part 2, Speech (LDC2005S13)

  • Arabic CTS Levantine Fisher Training Data Set3, Transcripts (LDC2005T03)

  • Fisher English Training Part 2, Transcripts (LDC2005T19)

  • Fisher Levantine Arabic Conversational Telephone Speech (LDC2007S02)

  • Fisher Levantine Arabic Conversational Telephone Speech, Transcripts (LDC2007T04)

  • Fisher Spanish Speech (LDC2010S01)

  • Fisher Spanish - Transcripts (LDC2010T04)


Other corpora

LDC corpora

  • Speech in Noisy Environments (SPINE) Training Audio (LDC2000S87)

  • Speech in Noisy Environments (SPINE) Evaluation Audio (LDC2000S96)

  • Speech in Noisy Environments (SPINE) Training Transcripts (LDC2000T49)

  • Speech in Noisy Environments (SPINE) Evaluation Transcripts (LDC2000T54)

  • Speech in Noisy Environments (SPINE2) Part 1 Audio (LDC2001S04)

  • Speech in Noisy Environments (SPINE2) Part 2 Audio (LDC2001S06)

  • Speech in Noisy Environments (SPINE2) Part 3 Audio (LDC2001S08)

  • Speech in Noisy Environments (SPINE2) Part1 Transcripts (LDC2001T05)

  • Speech in Noisy Environments (SPINE2) Part2 Transcripts (LDC2001T07)

  • Speech in Noisy Environments (SPINE2) Part3 Transcripts (LDC2001T09)

  • Santa Barbara Corpus of Spoken American English Part I (LDC2000S85)

  • Santa Barbara Corpus of Spoken American English Part II (LDC2003S06)

  • Santa Barbara Corpus of Spoken American English PartIII (LDC2004S10)

  • Santa Barbara Corpus of Spoken American English PartIV (LDC2005S25)

  • HAVIC Pilot Transcription (LDC2016V01)

Non-LDC corpora



[back to top]



System Descriptions

Proper interpretation of the evaluation results requires thorough documentation of each system. Consequently, at the end of the evaluation researchers must submit a full description of their system with sufficient detail for a fellow researcher to understand the approach and data/computational requirements. An acceptable system description should include the following information:

  • Abstract

  • Data resources

  • Detailed description of algorithm

  • Hardware requirements

Section 1: Abstract

A short (a few sentences) high-level description of the system.


Section 2: Data resources

This section should describe the data used for training including both volumes and sources. For LDC or ELRA corpora, catalog ids should be supplied. For other publicly available corpora (e.g., AMI) a link should be provided. In cases where a non-publicly available corpus is used, it should be described in sufficient detail to get the gist of its composition. If the system is composed of multiple components and different components are trained using different resources, there should be an accompanying description of which resources were used for which components.


Section 3: Detailed description of algorithm

Each component of the system should be described in sufficient detail that another researcher would be able to reimplement it. You may be brief or omit entirely description of components that are standard (e.g., no need to list the standard equations underlying an LSTM or GRU). If hyperparameter tuning was performed, there should be detailed description both of the tuning process and the final hyperparameters arrived at.

We suggest including subsections for each major phase in the system. Suggested subsections:

  • signal processing – e.g., signal enhancement, denoising, source separation

  • acoustic features – e.g., MFCCs, PLPs, mel fiterbank, PNCCs, RASTA, pitch extraction

  • speech activity detection details – relevant for Track 2 only

  • segment representation – e.g., i-vectors, d-vectors

  • speaker estimation – how number of speakers was estimated if such estimation was performed

  • clustering method – e.g., k-means, agglomerative

  • resegmentation details

Section 4: Hardware requirements

System developers should report the hardware requirements for both training and at test time:

  • Total number of CPU cores used

  • Description of CPUs used (model, speed, number of cores)

  • Total number of GPUs used

  • Description of used GPUs (model, single precision TFLOPS, memory)

  • Total available RAM

  • Used disk storage

  • Machine learning frameworks used (e.g., PyTorch, Tensorflow, CNTK, etc)

System execution times to process a single 10 minute recording must be reported.


[back to top]