Efficient data selection for machine translation


A. Mandal et al., “Efficient data selection for machine translation,” In Proc. 2008 IEEE Workshop on Spoken Language Technology (SLT 2008), pp. 261–264.


Performance of statistical machine translation (SMT) systems relies on the availability of a large parallel corpus which is used to estimate translation probabilities. However, the generation of such corpus is a long and expensive process. In this paper, we introduce two methods for efficient selection of training data to be translated by humans. Our methods are motivated by active learning and aim to choose new data that adds maximal information to the currently available data pool. The first method uses a measure of disagreement between multiple SMT systems, whereas the second uses a perplexity criterion. We performed experiments on Chinese-English data in multiple domains and test sets. Our results show that we can select only one-fifth of the additional training data and achieve similar or better translation performance, compared to that of using all available data.

Index Terms— machine translation, data selection

Read more from SRI