Nothing Special   »   [go: up one dir, main page]

Skip to main content

Linear Co-occurrence Rate Networks (L-CRNs) for Sequence Labeling

  • Conference paper
  • First Online:
Statistical Language and Speech Processing (SLSP 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8791))

Included in the following conference series:

  • 994 Accesses

Abstract

Sequence labeling has wide applications in natural language processing and speech processing. Popular sequence labeling models suffer from some known problems. Hidden Markov models (HMMs) are generative models and they cannot encode transition features; Conditional Markov models (CMMs) suffer from the label bias problem; And training of conditional random fields (CRFs) can be expensive. In this paper, we propose Linear Co-occurrence Rate Networks (L-CRNs) for sequence labeling which avoid the mentioned problems with existing models. The factors of L-CRNs can be locally normalized and trained separately, which leads to a simple and efficient training method. Experimental results on real-world natural language processing data sets show that L-CRNs reduce the training time by orders of magnitudes while achieve very competitive results to CRFs.

Our C++ implementation of L-CRNs and the datasets used in this paper can be found at https://github.com/zheminzhu/Co-occurrence-Rate-Networks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Another popular model is structured (structural) SVM [1] which essentially applies factorization to kernels. Due to its lack of a direct probabilistic interpretation, we leave it for future work.

  2. 2.

    In this extreme case, the entropy of \(p(s_{i+1}|s_{i})\) is the lowest: 0.

  3. 3.

    HMMs do not suffer from the label bias problem, because the factors \(p(o_i|s_i)\) in Eq. 1 guarantee that the observation evidence is always used.

  4. 4.

    Sometimes they are intuitively explained as the compatibility of the nodes in cliques. But the notion compatibility has no mathematical definition.

  5. 5.

    [11, 12] show superiority of CRFs over other models. Hence it is reasonable to compare with CRFs.

  6. 6.

    L-CRNs can be easily parallellized. Obviously, each regression model can be trained parallely with others.

  7. 7.

    http://www.cnts.ua.ac.be/conll2003/ner/

  8. 8.

    Known words are the words that appear in the training data. Unknown words are the words that have not been seen in the training data. All words include both.

  9. 9.

    http://www.cnts.ua.ac.be/conll2000/chunking/conlleval.txt

References

  1. Altun, Y., Smola, A.J., Hofmann, T.: Exponential families for conditional random fields. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI ’04, pp. 2–9. AUAI Press (2004)

    Google Scholar 

  2. Berg-Kirkpatrick, T., Bouchard-Côté, A., DeNero, J., Klein, D.: Painless unsupervised learning with features. In: NAACL, HLT ’10, pp. 582–590 (2010)

    Google Scholar 

  3. Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)

    Google Scholar 

  4. Cohn, T.A.: Scaling conditional random fields for natural language processing. Ph.D. thesis, University of Melbourne (2007)

    Google Scholar 

  5. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)

    MATH  Google Scholar 

  6. Ghahramani, Z.: An introduction to hidden Markov models and Bayesian networks. In: Juang, B.H. (ed.) Hidden Markov Models, pp. 9–42. World Scientific Publishing, Adelaide (2002)

    Google Scholar 

  7. Hammersley, J.M., Clifford, P.E.: Markov random fields on finite graphs and lattices. Unpublished manuscript (1971)

    Google Scholar 

  8. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning, ECML ’98, pp. 137–142 (1998)

    Google Scholar 

  9. Klein, D., Manning, C.D.: Conditional structure versus conditional estimation in NLP models. In: EMNLP ’02, pp. 9–16 (2002). http://dx.doi.org/10.3115/1118693.1118695

  10. Kudo, T.: CRF++: yet another CRF toolkit, free software, March 2012. http://crfpp.googlecode.com/svn/trunk/doc/index.html

  11. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML ’01, pp. 282–289 (2001)

    Google Scholar 

  12. Le-Hong, P., Phan, X.H., Tran, T.T.: On the effect of the label bias problem in part-of-speech tagging. In: The 10th IEEE RIVF International Conference on Computing and Communication Technologies, pp. 103–108 (2013)

    Google Scholar 

  13. McCallum, A., Freitag, D., Pereira, F.C.N.: Maximum entropy markov models for information extraction and segmentation. In: ICML ’00, pp. 591–598 (2000)

    Google Scholar 

  14. Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. In: Proceedings of the IEEE, pp. 257–286 (1989)

    Google Scholar 

  15. Smola, A.J., Schölkopf, B.: A tutorial on support vector regression. Stat. Comput. 14(3), 199–222 (2004)

    Article  MathSciNet  Google Scholar 

  16. Sutton, C., McCallum, A.: An introduction to conditional random fields. Found. Trends Mach. Learn. 4(4), 267–373 (2012)

    Article  Google Scholar 

  17. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of CoNLL-2003, Edmonton, Canada (2003)

    Google Scholar 

  18. Zhu, Z., Hiemstra, D., Apers, P.M.G., Wombacher, A.: Separate training for conditional random fields using co-occurrence rate factorization. Technical report TR-CTIT-12-29, Centre for Telematics and Information Technology, University of Twente, Enschede, October 2012

    Google Scholar 

  19. Zhu, Z., Hiemstra, D., Apers, P.M.G., Wombacher, A.: Empirical co-occurrence rate networks for sequence labeling. In: ICMLA 2013, Miami Beach, FL, USA, December 2013, pp. 93–98 (2013)

    Google Scholar 

  20. Zhu, Z.: Factorizing probabilistic graphical models using cooccurrence rate, August 2010. arXiv:1008.1566v1

  21. Zhu, Z., Hiemstra, D., Apers, P., Wombacher, A.: Comparison of local and global undirected graphical models. In: The 22nd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp. 479–484 (2014)

    Google Scholar 

Download references

Acknowledgments

We thank SLSP 2014 reviewers for their comments. This work has been supported by the Dutch national program COMMIT/.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhemin Zhu .

Editor information

Editors and Affiliations

6 Appendix

6 Appendix

1.1 6.1 Closed-Form MLE Training of L-CRN

We maximize the \(\log \) likelihood of Eq. 3 over the training dataset \(D\) with \(\mathrm {CR}\) and \(p\) as parameters:

$$\begin{aligned} \max .&\sum _{(S,O)\in D}[\sum _{i=1}^{n-1}\log \mathrm {CR}(s_i;s_{i+1}|O) + \sum _{j=1}^{n}\log p(s_j|O)]\\ s.t. \quad&\sum _{s,s'}\mathrm {CR}(s;s'|O)p(s|O)p(s'|O)=1, \forall s,s'\\ \quad&\sum _{s}p(s|O)=1, \forall s\\ \quad&\mathrm {CR}(s;s'|O)\ge 0, \forall s,s'\\ \quad&p(s|O)\ge 0, \forall s \end{aligned}$$

First we ignore the last two non-negative inequality constraints. Using Lagrange Multiplier, we obtain the unconstrained objective function:

$$\begin{aligned} \nonumber&\sum _{(S,O)\in D}[\sum _{i=1}^{n-1}\log \mathrm {CR}(s_i;s_{i+1}|O) + \sum _{j=1}^{n}\log p(s_j|O)]+\\ \nonumber&\sum _{s,s'}[\lambda _{s,s'}(\sum _{s,s'}\mathrm {CR}(s;s'|O)p(s|O)p(s'|O)-1)]\\&+\sum _{s}[\lambda _{s}(\sum _{s}p(s|O)-1)]. \end{aligned}$$

Calculate the first derivative for each parameter and set them to zero, we get the closed form MLE for \(\mathrm {CR}\) and \(p\):

$$\begin{aligned}&\hat{p}(s|O)=\frac{\#(s,O)}{\sum _s \#(s,O)},\\&\hat{CR}(s;s'|O)= \frac{\#(s,s',O)}{\sum _{s,s'} \#(s,s',O)} / \frac{\#(s,O)}{\sum _s \#(s,O)} / \frac{\#(s',O)}{\sum _{s'} \#(s',O)}. \end{aligned}$$

That is, the MLE of \(p\) and \(\mathrm {CR}\) are just their relative frequencies in the training dataset. Fortunately the non-negative inequality constraints which were ignored in optimization are automatically met.

1.2 6.2 Theorems of Co-occurrence Rate

Proof of Partition Operation

Proof

$$\begin{aligned}&\mathrm {CR}(X_1;..;X_j)\mathrm {CR}(X_{j+1};..;X _n)\mathrm {CR}(X_1..X_j;X_{j+1}..X _n)\\&=\frac{p(X_1,..,X_j)}{p(X_1)..p(X_j)}\frac{p(X_{j+1},..,X _n)}{p(X_{j+1})..p(X _n)}\frac{p(X_{1},..,X _n)}{p(X_1,..,X_j)p(X_{j+1},.., X _n)}\\&=\frac{p(X_{1},..,X _n)}{p(X_1)..p(X _n)}=\mathrm {CR}(X_1;..;X _n). \end{aligned}$$

Proof of Reduce Operation

Proof

Since \(X\perp \!\!\!\perp Y \mathbin {|} Z\), we have \(p(X,Y|Z)=p(X|Z)p(Y|Z)\), then \(p(XYZ)=\frac{p(X,Z)p(Y,Z)}{p(Z)}\). Hence,

$$\begin{aligned} \mathrm {CR}(X;YZ)=\frac{p(X,Y,Z)}{p(X)p(Y,Z)}=\frac{p(X,Y,Z)}{p(X)p(Y,Z)} =\frac{p(X,Z)}{p(X)p(Z)}=\mathrm {CR}(X;Z). \end{aligned}$$

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Zhu, Z., Hiemstra, D., Apers, P. (2014). Linear Co-occurrence Rate Networks (L-CRNs) for Sequence Labeling. In: Besacier, L., Dediu, AH., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2014. Lecture Notes in Computer Science(), vol 8791. Springer, Cham. https://doi.org/10.1007/978-3-319-11397-5_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11397-5_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11396-8

  • Online ISBN: 978-3-319-11397-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics