Abstract
Sequence labeling has wide applications in natural language processing and speech processing. Popular sequence labeling models suffer from some known problems. Hidden Markov models (HMMs) are generative models and they cannot encode transition features; Conditional Markov models (CMMs) suffer from the label bias problem; And training of conditional random fields (CRFs) can be expensive. In this paper, we propose Linear Co-occurrence Rate Networks (L-CRNs) for sequence labeling which avoid the mentioned problems with existing models. The factors of L-CRNs can be locally normalized and trained separately, which leads to a simple and efficient training method. Experimental results on real-world natural language processing data sets show that L-CRNs reduce the training time by orders of magnitudes while achieve very competitive results to CRFs.
Our C++ implementation of L-CRNs and the datasets used in this paper can be found at https://github.com/zheminzhu/Co-occurrence-Rate-Networks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Another popular model is structured (structural) SVM [1] which essentially applies factorization to kernels. Due to its lack of a direct probabilistic interpretation, we leave it for future work.
- 2.
In this extreme case, the entropy of \(p(s_{i+1}|s_{i})\) is the lowest: 0.
- 3.
HMMs do not suffer from the label bias problem, because the factors \(p(o_i|s_i)\) in Eq. 1 guarantee that the observation evidence is always used.
- 4.
Sometimes they are intuitively explained as the compatibility of the nodes in cliques. But the notion compatibility has no mathematical definition.
- 5.
- 6.
L-CRNs can be easily parallellized. Obviously, each regression model can be trained parallely with others.
- 7.
- 8.
Known words are the words that appear in the training data. Unknown words are the words that have not been seen in the training data. All words include both.
- 9.
References
Altun, Y., Smola, A.J., Hofmann, T.: Exponential families for conditional random fields. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI ’04, pp. 2–9. AUAI Press (2004)
Berg-Kirkpatrick, T., Bouchard-Côté, A., DeNero, J., Klein, D.: Painless unsupervised learning with features. In: NAACL, HLT ’10, pp. 582–590 (2010)
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)
Cohn, T.A.: Scaling conditional random fields for natural language processing. Ph.D. thesis, University of Melbourne (2007)
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
Ghahramani, Z.: An introduction to hidden Markov models and Bayesian networks. In: Juang, B.H. (ed.) Hidden Markov Models, pp. 9–42. World Scientific Publishing, Adelaide (2002)
Hammersley, J.M., Clifford, P.E.: Markov random fields on finite graphs and lattices. Unpublished manuscript (1971)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning, ECML ’98, pp. 137–142 (1998)
Klein, D., Manning, C.D.: Conditional structure versus conditional estimation in NLP models. In: EMNLP ’02, pp. 9–16 (2002). http://dx.doi.org/10.3115/1118693.1118695
Kudo, T.: CRF++: yet another CRF toolkit, free software, March 2012. http://crfpp.googlecode.com/svn/trunk/doc/index.html
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML ’01, pp. 282–289 (2001)
Le-Hong, P., Phan, X.H., Tran, T.T.: On the effect of the label bias problem in part-of-speech tagging. In: The 10th IEEE RIVF International Conference on Computing and Communication Technologies, pp. 103–108 (2013)
McCallum, A., Freitag, D., Pereira, F.C.N.: Maximum entropy markov models for information extraction and segmentation. In: ICML ’00, pp. 591–598 (2000)
Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. In: Proceedings of the IEEE, pp. 257–286 (1989)
Smola, A.J., Schölkopf, B.: A tutorial on support vector regression. Stat. Comput. 14(3), 199–222 (2004)
Sutton, C., McCallum, A.: An introduction to conditional random fields. Found. Trends Mach. Learn. 4(4), 267–373 (2012)
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of CoNLL-2003, Edmonton, Canada (2003)
Zhu, Z., Hiemstra, D., Apers, P.M.G., Wombacher, A.: Separate training for conditional random fields using co-occurrence rate factorization. Technical report TR-CTIT-12-29, Centre for Telematics and Information Technology, University of Twente, Enschede, October 2012
Zhu, Z., Hiemstra, D., Apers, P.M.G., Wombacher, A.: Empirical co-occurrence rate networks for sequence labeling. In: ICMLA 2013, Miami Beach, FL, USA, December 2013, pp. 93–98 (2013)
Zhu, Z.: Factorizing probabilistic graphical models using cooccurrence rate, August 2010. arXiv:1008.1566v1
Zhu, Z., Hiemstra, D., Apers, P., Wombacher, A.: Comparison of local and global undirected graphical models. In: The 22nd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp. 479–484 (2014)
Acknowledgments
We thank SLSP 2014 reviewers for their comments. This work has been supported by the Dutch national program COMMIT/.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
6 Appendix
6 Appendix
1.1 6.1 Closed-Form MLE Training of L-CRN
We maximize the \(\log \) likelihood of Eq. 3 over the training dataset \(D\) with \(\mathrm {CR}\) and \(p\) as parameters:
First we ignore the last two non-negative inequality constraints. Using Lagrange Multiplier, we obtain the unconstrained objective function:
Calculate the first derivative for each parameter and set them to zero, we get the closed form MLE for \(\mathrm {CR}\) and \(p\):
That is, the MLE of \(p\) and \(\mathrm {CR}\) are just their relative frequencies in the training dataset. Fortunately the non-negative inequality constraints which were ignored in optimization are automatically met.
1.2 6.2 Theorems of Co-occurrence Rate
Proof of Partition Operation
Proof
Proof of Reduce Operation
Proof
Since \(X\perp \!\!\!\perp Y \mathbin {|} Z\), we have \(p(X,Y|Z)=p(X|Z)p(Y|Z)\), then \(p(XYZ)=\frac{p(X,Z)p(Y,Z)}{p(Z)}\). Hence,
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhu, Z., Hiemstra, D., Apers, P. (2014). Linear Co-occurrence Rate Networks (L-CRNs) for Sequence Labeling. In: Besacier, L., Dediu, AH., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2014. Lecture Notes in Computer Science(), vol 8791. Springer, Cham. https://doi.org/10.1007/978-3-319-11397-5_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-11397-5_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11396-8
Online ISBN: 978-3-319-11397-5
eBook Packages: Computer ScienceComputer Science (R0)