Abstract
In order to find hyperparameters for a machine learning model, algorithms such as grid search or random search are used over the space of possible values of the models’ hyperparameters. These search algorithms opt the solution that minimizes a specific cost function. In language models, perplexity is one of the most popular cost functions. In this study, we propose a fractional nonlinear programming model that finds the optimal perplexity value. The special structure of the model allows us to approximate it by a linear programming model that can be solved using the well-known simplex algorithm. To the best of our knowledge, this is the first attempt to use optimization techniques to find perplexity values in the language modeling literature. We apply our model to find hyperparameters of a language model and compare it to the grid search algorithm. Furthermore, we illustrate that it results in lower perplexity values. We perform this experiment on a real-world dataset from SwiftKey to validate our proposed approach.
Similar content being viewed by others
Notes
SwiftKey is an input method that employs various artificial intelligence technologies to predict the next word the user intends to type. For an available free download, visit: https://swiftkey.com/en.
References
Bazaraa MS, Jarvis JJ, Sherali HD (2010) Linear programming and network flows, 4th edn. Wiley, Hoboken
Bazaraa MS, Sherali HD, Shetty CM (2013) Nonlinear programming: theory and algorithms. Wiley, Hoboken
Bengio Y, Ducharme R, Vincent P, Janvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155
Bergstra J, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyper-parameter optimization. In: Advances in neural information processing systems
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(1):281–305
Bousquet O, Boucheron S, Lugosi G (2004) Introduction to statistical learning theory. In: Advanced lectures on machine learning. Springer, Berlin, pp 169–207
Boyd SP, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
Brants T, Brants T, Popat AC, Xu P, Och FJ, Dean J (2007) Large language models in machine translation. In: The Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp 858–867
Brown PF, Desouza PV, Mercer RL, Pietra VJD, Lai JC (1984) Class based n-gram models of natural language. In: Computational linguistics, vol 18. Association for Computational Linguistics
Charnes A, Cooper WW (1962) Programming with linear fractional functionals. Naval Res Logist Q 9:181–186
Chen SF, Rosenfeld R (1999) A Gaussian prior for smoothing maximum entropy models. School of Computer Science, Carnegie Mellon University
Goodman J (2001) A bit of progress in language modeling. Technical Report
Halevy A, Norvig P, Pereira F (2009) The unreasonable effectiveness of data. IEEE Intell Syst 24(2):8–12
Hillier FS, Lieberman GJ (2001) Introduction to operations research. McGraw-Hill, New York
Mikolov T, Karafiát M, Burget L, Khudanpur S (2010) Recurrent neural network based language model. Interspeech 2:3
Samuel AL (1959) Some studies in machine learning using the game of checkers. IBM J Res Dev 3(3):206–226
Vose D (2008) Risk analysis: a quantitative guide, 3rd edn. Wiley, Hoboken
Acknowledgements
Mehdi Toloo would like to thank the European Social Fund (CZ.1.07/2.3.00/20.0296) and the Czech Science Foundation (GAČR 17-23495S).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rahnama, A.H.A., Toloo, M. & Zaidenberg, N.J. An LP-based hyperparameter optimization model for language modeling. J Supercomput 74, 2151–2160 (2018). https://doi.org/10.1007/s11227-018-2236-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-018-2236-6