Author: SRI International

  • Pushing the Envelope — Aside

    Despite successes, there are still significant limitations to speech recognition performance. For this reason, authors have proposed methods that incorporate different (and larger) analysis windows, which are described in this article.

  • Spoken Language Understanding

    SLU systems contain an automatic speech recognition (ASR) component and must be robust to noise due to the spontaneous nature of spoken language and the errors introduced by ASR. SLU systems must perform text segmentation and understanding at the same time.

  • Does Active Learning Help Automatic Dialog Act Tagging in Meeting Data?

    We ask if active learning with lexical cues can help for this task and this domain. To better address this question, we explore active learning for two different types of DA models — hidden Markov models (HMMs) and maximum entropy (maxent).

  • Comparing HMM, Maximum Entropy, and Conditional Random Fields for Disfluency Detection

    We compare a generative hidden Markov model (HMM)-based approach and two conditional models — a maximum entropy (Maxent) model and a conditional random field (CRF) — for detecting disfluencies in speech. The conditional modeling approaches provide a more principled way to model correlated features.

  • Improved Discriminative Training Using Phone Lattices

    We present an efficient discriminative training procedure utilizing phone lattices. Different approaches to expediting lattice generation, statistics collection, and convergence were studied.

  • Two Experiments Comparing Reading with Listening for Human Processing of Conversational Telephone Speech

    We report on results of two experiments designed to compare subjects’ ability to extract information from audio recordings of conversational telephone speech (CTS) with their ability to extract information from text transcripts of these conversations, with and without the ability to hear the audio recordings.

  • Class-dependent Score Combination for Speaker Recognition

    In this work, we are presenting a class-based score combination technique that relies on clustering of both the target models and the test utterances in a vector space defined by a set of speaker-specific transformation parameters estimated during transcription of the talker.

  • MLLR Transforms as Features in Speaker Recognition

    We explore the use of adaptation transforms employed in speech recognition systems as features for speaker recognition. This approach is attractive because, unlike standard frame-based cepstral speaker recognition models, it normalizes for the choice of spoken words in text-independent speaker verification.

  • A Robust Method for Tracking Scene Text in Video Imagery

    We describe an approach that tracks planar regions of scene text that can undergo arbitrary 3-D rigid motion and scale changes. Our approach computes homographies on blocks of contiguous frames simultaneously using a combination of factorization and robust statistical methods.

  • Evidence-Centered Assessment Design: Layers, Structures, and Terminology (Padi Technical Report 9)

    This presentation provides an overview of ECD, highlighting the ideas of layers in the process, structures and representations within layers, and terms and concepts that can be used to guide the design of assessments of practically all types. Examples are drawn from the Principled Assessment Designs for Inquiry (PADI) project.

  • Task Templates Based on Misconception Research (Padi Technical Report 6)

    This paper reports one such effort, motivated by assessments that elicit students’ qualitative explanations of situations that have been designed to provoke misconceptions and partial understandings. We describe four task-specific templates we created—three based on Hestenes, Wells, and Swackhamer’s Force Concept Inventory and one based on Novick and Nussbaums’s Test about Particles in a Gas.

  • Identifying and Segmenting Human-Motion for Mobile Robot Navigation using alignment errors

    This paper presents a new human-motion identification and segmentation algorithm from moving cameras. The algorithm is based on alignment error between pairs of moving object images. Pairs of object images generating relatively small alignment errors are used to estimate the fundamental frequency of the object motion.