Linear Co-occurrence Rate Networks (L-CRNs) for Sequence Labeling

Zhemin Zhu⁷,
Djoerd Hiemstra⁷ &
Peter Apers⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8791))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

994 Accesses

Abstract

Sequence labeling has wide applications in natural language processing and speech processing. Popular sequence labeling models suffer from some known problems. Hidden Markov models (HMMs) are generative models and they cannot encode transition features; Conditional Markov models (CMMs) suffer from the label bias problem; And training of conditional random fields (CRFs) can be expensive. In this paper, we propose Linear Co-occurrence Rate Networks (L-CRNs) for sequence labeling which avoid the mentioned problems with existing models. The factors of L-CRNs can be locally normalized and trained separately, which leads to a simple and efficient training method. Experimental results on real-world natural language processing data sets show that L-CRNs reduce the training time by orders of magnitudes while achieve very competitive results to CRFs.

Our C++ implementation of L-CRNs and the datasets used in this paper can be found at https://github.com/zheminzhu/Co-occurrence-Rate-Networks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Latent Variable CRF Model for Labeling Prediction

Sequence Segmentation Using Semi-Markov Conditional Random Fields

Article 20 March 2019

Structured Regularizer for Neural Higher-Order Sequence Models

Notes

1.
Another popular model is structured (structural) SVM [1] which essentially applies factorization to kernels. Due to its lack of a direct probabilistic interpretation, we leave it for future work.
2.
In this extreme case, the entropy of $p(s_{i+1}|s_{i})$ is the lowest: 0.
3.
HMMs do not suffer from the label bias problem, because the factors $p(o_i|s_i)$ in Eq. 1 guarantee that the observation evidence is always used.
4.
Sometimes they are intuitively explained as the compatibility of the nodes in cliques. But the notion compatibility has no mathematical definition.
5.
[11, 12] show superiority of CRFs over other models. Hence it is reasonable to compare with CRFs.
6.
L-CRNs can be easily parallellized. Obviously, each regression model can be trained parallely with others.
7.
http://www.cnts.ua.ac.be/conll2003/ner/
8.
Known words are the words that appear in the training data. Unknown words are the words that have not been seen in the training data. All words include both.
9.
http://www.cnts.ua.ac.be/conll2000/chunking/conlleval.txt

References

Altun, Y., Smola, A.J., Hofmann, T.: Exponential families for conditional random fields. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI ’04, pp. 2–9. AUAI Press (2004)
Google Scholar
Berg-Kirkpatrick, T., Bouchard-Côté, A., DeNero, J., Klein, D.: Painless unsupervised learning with features. In: NAACL, HLT ’10, pp. 582–590 (2010)
Google Scholar
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)
Google Scholar
Cohn, T.A.: Scaling conditional random fields for natural language processing. Ph.D. thesis, University of Melbourne (2007)
Google Scholar
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
MATH Google Scholar
Ghahramani, Z.: An introduction to hidden Markov models and Bayesian networks. In: Juang, B.H. (ed.) Hidden Markov Models, pp. 9–42. World Scientific Publishing, Adelaide (2002)
Google Scholar
Hammersley, J.M., Clifford, P.E.: Markov random fields on finite graphs and lattices. Unpublished manuscript (1971)
Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning, ECML ’98, pp. 137–142 (1998)
Google Scholar
Klein, D., Manning, C.D.: Conditional structure versus conditional estimation in NLP models. In: EMNLP ’02, pp. 9–16 (2002). http://dx.doi.org/10.3115/1118693.1118695
Kudo, T.: CRF++: yet another CRF toolkit, free software, March 2012. http://crfpp.googlecode.com/svn/trunk/doc/index.html
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML ’01, pp. 282–289 (2001)
Google Scholar
Le-Hong, P., Phan, X.H., Tran, T.T.: On the effect of the label bias problem in part-of-speech tagging. In: The 10th IEEE RIVF International Conference on Computing and Communication Technologies, pp. 103–108 (2013)
Google Scholar
McCallum, A., Freitag, D., Pereira, F.C.N.: Maximum entropy markov models for information extraction and segmentation. In: ICML ’00, pp. 591–598 (2000)
Google Scholar
Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. In: Proceedings of the IEEE, pp. 257–286 (1989)
Google Scholar
Smola, A.J., Schölkopf, B.: A tutorial on support vector regression. Stat. Comput. 14(3), 199–222 (2004)
Article MathSciNet Google Scholar
Sutton, C., McCallum, A.: An introduction to conditional random fields. Found. Trends Mach. Learn. 4(4), 267–373 (2012)
Article Google Scholar
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of CoNLL-2003, Edmonton, Canada (2003)
Google Scholar
Zhu, Z., Hiemstra, D., Apers, P.M.G., Wombacher, A.: Separate training for conditional random fields using co-occurrence rate factorization. Technical report TR-CTIT-12-29, Centre for Telematics and Information Technology, University of Twente, Enschede, October 2012
Google Scholar
Zhu, Z., Hiemstra, D., Apers, P.M.G., Wombacher, A.: Empirical co-occurrence rate networks for sequence labeling. In: ICMLA 2013, Miami Beach, FL, USA, December 2013, pp. 93–98 (2013)
Google Scholar
Zhu, Z.: Factorizing probabilistic graphical models using cooccurrence rate, August 2010. arXiv:1008.1566v1
Zhu, Z., Hiemstra, D., Apers, P., Wombacher, A.: Comparison of local and global undirected graphical models. In: The 22nd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp. 479–484 (2014)
Google Scholar

Download references

Acknowledgments

We thank SLSP 2014 reviewers for their comments. This work has been supported by the Dutch national program COMMIT/.

Author information

Authors and Affiliations

Electrical Engineering, Mathematics and Computer Science, University of Twente, Drienerlolaan 5, 7500 AE, Enschede, The Netherlands
Zhemin Zhu, Djoerd Hiemstra & Peter Apers

Authors

Zhemin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Djoerd Hiemstra
View author publications
You can also search for this author in PubMed Google Scholar
Peter Apers
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhemin Zhu .

Editor information

Editors and Affiliations

University Joseph Fourier, Grenoble, France
Laurent Besacier
Rovira i Virgili University, Tarragona, Spain
Adrian-Horia Dediu
Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide

6 Appendix

1.1 6.1 Closed-Form MLE Training of L-CRN

We maximize the $\log $ likelihood of Eq. 3 over the training dataset $D$ with $\mathrm {CR}$ and $p$ as parameters:

$$\begin{aligned} \max .&\sum _{(S,O)\in D}[\sum _{i=1}^{n-1}\log \mathrm {CR}(s_i;s_{i+1}|O) + \sum _{j=1}^{n}\log p(s_j|O)]\\ s.t. \quad&\sum _{s,s'}\mathrm {CR}(s;s'|O)p(s|O)p(s'|O)=1, \forall s,s'\\ \quad&\sum _{s}p(s|O)=1, \forall s\\ \quad&\mathrm {CR}(s;s'|O)\ge 0, \forall s,s'\\ \quad&p(s|O)\ge 0, \forall s \end{aligned}$$

First we ignore the last two non-negative inequality constraints. Using Lagrange Multiplier, we obtain the unconstrained objective function:

$$\begin{aligned} \nonumber&\sum _{(S,O)\in D}[\sum _{i=1}^{n-1}\log \mathrm {CR}(s_i;s_{i+1}|O) + \sum _{j=1}^{n}\log p(s_j|O)]+\\ \nonumber&\sum _{s,s'}[\lambda _{s,s'}(\sum _{s,s'}\mathrm {CR}(s;s'|O)p(s|O)p(s'|O)-1)]\\&+\sum _{s}[\lambda _{s}(\sum _{s}p(s|O)-1)]. \end{aligned}$$

Calculate the first derivative for each parameter and set them to zero, we get the closed form MLE for $\mathrm {CR}$ and $p$:

$$\begin{aligned}&\hat{p}(s|O)=\frac{\#(s,O)}{\sum _s \#(s,O)},\\&\hat{CR}(s;s'|O)= \frac{\#(s,s',O)}{\sum _{s,s'} \#(s,s',O)} / \frac{\#(s,O)}{\sum _s \#(s,O)} / \frac{\#(s',O)}{\sum _{s'} \#(s',O)}. \end{aligned}$$

That is, the MLE of $p$ and $\mathrm {CR}$ are just their relative frequencies in the training dataset. Fortunately the non-negative inequality constraints which were ignored in optimization are automatically met.

1.2 6.2 Theorems of Co-occurrence Rate

Proof of Partition Operation

Proof

$$\begin{aligned}&\mathrm {CR}(X_1;..;X_j)\mathrm {CR}(X_{j+1};..;X _n)\mathrm {CR}(X_1..X_j;X_{j+1}..X _n)\\&=\frac{p(X_1,..,X_j)}{p(X_1)..p(X_j)}\frac{p(X_{j+1},..,X _n)}{p(X_{j+1})..p(X _n)}\frac{p(X_{1},..,X _n)}{p(X_1,..,X_j)p(X_{j+1},.., X _n)}\\&=\frac{p(X_{1},..,X _n)}{p(X_1)..p(X _n)}=\mathrm {CR}(X_1;..;X _n). \end{aligned}$$

Proof of Reduce Operation

Proof

Since $X\perp \!\!\!\perp Y \mathbin {|} Z$, we have $p(X,Y|Z)=p(X|Z)p(Y|Z)$, then $p(XYZ)=\frac{p(X,Z)p(Y,Z)}{p(Z)}$. Hence,

$$\begin{aligned} \mathrm {CR}(X;YZ)=\frac{p(X,Y,Z)}{p(X)p(Y,Z)}=\frac{p(X,Y,Z)}{p(X)p(Y,Z)} =\frac{p(X,Z)}{p(X)p(Z)}=\mathrm {CR}(X;Z). \end{aligned}$$

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, Z., Hiemstra, D., Apers, P. (2014). Linear Co-occurrence Rate Networks (L-CRNs) for Sequence Labeling. In: Besacier, L., Dediu, AH., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2014. Lecture Notes in Computer Science(), vol 8791. Springer, Cham. https://doi.org/10.1007/978-3-319-11397-5_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-11397-5_14
Published: 03 September 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11396-8
Online ISBN: 978-3-319-11397-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Linear Co-occurrence Rate Networks (L-CRNs) for Sequence Labeling

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Latent Variable CRF Model for Labeling Prediction

Sequence Segmentation Using Semi-Markov Conditional Random Fields

Structured Regularizer for Neural Higher-Order Sequence Models

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

6 Appendix

1.1 6.1 Closed-Form MLE Training of L-CRN

1.2 6.2 Theorems of Co-occurrence Rate

Proof

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Linear Co-occurrence Rate Networks (L-CRNs) for Sequence Labeling

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Latent Variable CRF Model for Labeling Prediction

Sequence Segmentation Using Semi-Markov Conditional Random Fields

Structured Regularizer for Neural Higher-Order Sequence Models

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

6 Appendix

6 Appendix

1.1 6.1 Closed-Form MLE Training of L-CRN

1.2 6.2 Theorems of Co-occurrence Rate

Proof

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation