Event Geoparser with Pseudo-Location Entity Identification and Numerical Argument Extraction Implementation and Evaluation in Indonesian News Domain
<p>A comparison of toponym-level and event-level scope of resolution. Unlike toponym-level geoparser, the event geoparser is not only detecting toponym and resolving the coordinate but also detecting what event(s) happened and infering which toponym is the real location (LOC) or only pseudo-location (PLOC) with regard to that event.</p> "> Figure 2
<p>Taxonomy of toponym ambiguity. Even though regular geoparsers are already capable of filtering geo/non-geo ambiguities and assigning disambiguated coordinates to geo/geo referential ambiguities, they cannot yet handle event geolocation properly, i.e., recognize and resolve toponyms that are both locative and precise of <span class="html-italic">particular events</span> by discarding all “pseudo-location” entities which are irrelevant to that event. Note that pseudo-location entities may appear either as literal or associative toponyms as well.</p> "> Figure 3
<p>Sample Ontology of Vehicle Accident. Events can be modeled as grouping of various semantic roles and their arguments, forming templates for different types of events. For example, the Accident event has two semantic roles regarding location entity and one numerical argument.</p> "> Figure 4
<p>Zipf curves for the Indonesian corpus (650 K).</p> "> Figure 5
<p>Integrated Event Extraction and Geoparsing: accept news document as input, resolving toponym and other entities (a), event triggers type (t), arguments (r), and event locations (g) from text. It is chaining geotagging with toponym resolution and event extraction. The system uses semantic gazetteer for features and regular expression rules learned from large corpus to increase the precision, recall, and geoparser accuracy.</p> "> Figure 6
<p>Aggregated topic model plate notation and schema.</p> "> Figure 7
<p>Treemap of Topic Proportions (<b>a</b>) and the top-words from Two Sample Topics (accident and fire) (<b>b</b>). The area shown on the left figure is determined by the number of topic assignments to that particular label/C(φ). The area shown on the right figure is determined by the probability of each word within that topic/Pφk(w<sub>1</sub>).</p> "> Figure 8
<p>Regular expression to detect numerical argument. The argument typically either started or ended with the role keyword followed by various numerical quantities, followed by the unit of the argument. For example, “the accident left 2 people killed” will be extracted as 2 (numeric) people (unit) killed (role).</p> "> Figure 9
<p>Illustration of difference of strategies between original Spatial Minimality (<b>a</b>) and Spatial Minimality Centroid Distance (SMCD) on the (<b>b</b>). The SMCD-ADM is SMCD but with adjustment on the weight factor of the distance.</p> "> Figure 10
<p>Ablation Test of weighted F1 score from nine combinations of features of four geotagging steps (step 1, 3, 4, and 5 from <a href="#ijgi-09-00712-f004" class="html-fig">Figure 4</a>). Active features are marked as blue cells (below part of the graphic). Missing score points means such combination of features is not applicable on that particular step.</p> "> Figure 11
<p>Event geoparser correctly assigned the real LOC label to Demak and pseudo-location PLOC label to Kudus with the help of event argument (step 4) and event trigger feature (step 3) outputs. Entity tags from Step 1 output are omitted for clarity.</p> "> Figure 12
<p>Taxonomy generated (translated) by topic similarity metric from seed root node (topic tag) “banjir_jakarta” (jakarta_flood). The leaf nodes are the top words of their respective parent topic. The generated tree is limited to the 5 most similar topics.</p> "> Figure 13
<p>Visualization sample from a single article with result from our proposed Event Geoparser. The source article is displayed and tagged with colors, indicating arguments, event trigger, and pseudo-location/real-location label for every detected toponym.</p> ">
Abstract
:1. Introduction
2. Related Works
2.1. Scope of Resolution of Geoparsers: Toponym-Level, Document-Level, and Event-Level
2.2. Mainstream Approaches in Geotagging: Gazetteer and Data-Driven NER Approach
2.3. Geotagging True (Locative and Precise) Location Toponyms Relative to an Event
2.4. Integrating Event Extraction Model into Geoparsing
2.5. Geocoding (Toponym Resolution) Process and Strategies
2.6. Increasing Model Generalizability with Topic Modeling
3. Geospatial News Event Extraction Corpus
4. Approach
4.1. Task Formulation
4.2. Three Stages of Event Geoparsing Workflow
- Geotagging, in which named literal geographical entities (toponyms) are recognized from other named entities. This is where the Named Entity Recognition is typically invoked to recognize location entities.
- Geocoding (or toponym resolution step) in which correct toponyms are disambiguated from other toponym candidates (potential referents) and then assigned correct geographic coordinate. This is obviously a toponym-level scope resolution and calculated using spatial minimality based algorithm.We are hoping to have a deeper integration of event extraction into geoparsing by extending those original two steps, in a more transparent flow of features unlike the typical combination of event coder + geoparser such as or TABARI/Leetaru or PETRARCH/CLIFF geoparser [68]. In particular, the model runs event extraction stage after the geoparsing stage (geotagging and geocoding), followed by event level scope resolution stage, as can be seen in dotted boxes in Figure 5. This will provide event record data to be stored along with place data. The second stage is the event extraction stage, which comprises two steps:
- Event trigger classification. This step is to recognize the event triggers and provide event code label based on the detected class.
- Argument Extraction. This step is to recognize semantic roles within event and extract arguments, including numerical ones.The final stage is to resolve the location of the event (event-level geoparsing). This stage is comprised of the following steps:
- Pseudo-location Identification. This step is to classify each LOC entities detected in the step 1 into either PLOC (pseudo-location) or LOC (real location).
- Event coreference resolution. This step is to group several events of the same instance in the document into a single event structure.
4.3. Analysis of the Topic and Event Space: Tying Themes to Geospatial Referenced Text
Algorithm 1. Merge function to form the aggregated model. |
function merge (φ1, φ2): |
1: input: |
2: : topics to be merged |
3: C: topic assignments count for all topic |
4: output: new topic |
5: begin: |
6: create new which has all top-words from both |
7: let |
8: for each and |
9: if w exists in both φ1, φ2: |
10: let |
11: else if w exists only in φ1: |
12: let |
13: else if w exists only in φ2: |
14: let |
15: end if |
16: append w into φ’ |
17: end for |
18: set |
Algorithm 2. Aggregate procedure. |
procedure aggregate: |
1: input: |
2: T: set of topics |
3: C: topic assignments count for all topic |
4: Λ: set of labels of all topic |
5: output: merged topic model |
6: begin: |
7: initialize M = {} |
8: for each topic φ ∈ Φ: |
9: if exists in M: |
10: let where |
11: |
12: append into M |
13: else |
14: append into M, with adjusted |
15: end if |
16: end for |
17: end |
4.4. Semantic Gazetteer for Event Keywords Feature and Numeric Argument Recognition
- Semantically related terms given a topic label, which is produced by our Aggregated Topic Model. (n-top-words).
- Semantic similarity produced by Word2Vec [62] most_similar() function.
- Bigrams counts produced by NLTK package n-gram analysis.
4.5. Smallest Administrative Level (SAL) Geospatial Feature for Pseudo-Location Identification
- (1)
- The area calculation is replaced by the calculation of distance of points to its centroid (Centroid Distance). This is useful for speeding up the process and to avoid the degenerate cases where there are only two or less toponyms inside the document. In other words, the minimality of area is replaced by the minimality of the distance of polygon candidates to its centroid (see Figure 9).
- (2)
- The minimality of distance is adjusted by multiplying it by the administrative level of an area. Hence, the smaller administrative is a candidate referent, the less preferred it is. Note that this is the reverse principle from the smallest administrative feature to find out the smallest administrative area. This is because in this toponym resolution task, what is sought is the commonality of toponym mention, instead of the precision of the place mention on the Pseudo-location Identification task.
Algorithm 3. Algorithm for finding Smallest Administrative Level feature using Spatial Minimality (SM) from [12]. | |
function getSmallestAdministrativeLevel (D: document, G: gazetteer): 1: output: smallest administrative level of the document 2: begin: 3: initialize toponyms T = {} 4: T = extract location entities from D 5: DT = DisambiguateDocumentSM (T, G) 6: L = {} 7: for each t in DT: 8: adm_level = lookup administrative level 9: of t from G 10: append adm_level to L 11: return maximum adm_level from L 12: end | function DisambiguateDocumentSM (T: list of toponyms, G: gazetteer): 1: begin 2: for each t in T: 3: let (t) = lookup set all possible candidate- 4: referents tuples from t in gazetteer 5: let S = cross product of 6: for each N-tuple C do: 7: H = polygon from all centroids in C 8: A = Calculate area of H 9: return tuple C* that has minimum A from all tuple C 10: end |
Algorithm 4. Modified Spatial Minimality with Centroid Distance and adjustment factor based on Administrative level and adjustment constant M (SMCD-ADM). |
function DisambiguateDocumentSMCD-ADM(T: list of toponyms, G: gazetteer): 1: begin 2: for each t in T: 3: let (t) = lookup set all possible candidate references from t in gazetteer G 4: let S = cross product of 5: for each N-tuple C : 6: Cd = calculate centroid of all points in C using G 7: maxP = find point p that has maximum distance to centroid Cd 8: maxdistc = distance of maxP to centroid Cd 9: adm_levelc = administrative level of maxP 10: adjusted_maxdistc = (adm_levelc + 1) maxdistc 11: return tuple C that has smallest adjusted_maxdistc 12: end |
5. Experiments and Results
5.1. Geotagging
5.2. The Pseudo-Location Classification
5.3. Aggregated Topic Model
5.4. Disambiguation and Toponym Resolution
5.5. Auto Generation of Rich Thematic Map from Single Article
6. Discussion
7. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Himmelstein, M. Local search: The Internet is the Yellow Pages. Computer 2005, 38, 26–34. [Google Scholar] [CrossRef] [Green Version]
- Wunderwald, M. NewsX: Event Extraction from News Articles. Master’s Thesis, Dresden University of Technology, Dresden, Germany, 2011. [Google Scholar]
- Gelernter, J.; Balaji, S. An algorithm for local geoparsing of microtext. GeoInformatica 2013, 17, 635–667. [Google Scholar] [CrossRef]
- Wang, W.; Stewart, K. Spatiotemporal and semantic information extraction from Web news reports about natural hazards. Comput. Environ. Urban Syst. 2015, 50, 30–40. [Google Scholar] [CrossRef]
- Freifeld, C.C.; Mandl, K.D.; Reis, B.Y.; Brownstein, J.S. HealthMap: Global Infectious Disease Monitoring through. J. Am. Med. Inform. Assoc. 2008, 15, 150–157. [Google Scholar] [CrossRef] [PubMed]
- Purves, R.; Clough, P.; Jones, C.B.; Arampatzis, A.; Bucher, B.; Finch, D.; Fu, G.; Joho, H.; Syed, A.K.; Vaid, S.; et al. The design and implementation of SPIRIT: A spatially aware search engine for information retrieval on the Internet. Int. J. Geogr. Inf. Sci. 2007, 21, 717–745. [Google Scholar] [CrossRef]
- Gritta, M.; Pilehvar, M.T.; Collier, N. A pragmatic guide to geoparsing evaluation. Lang. Resour. Eval. 2020, 54, 683–712. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Woodruff, A.G. (GIPSY) Georeferenced Information Processing System. J. Am. Soc. Inf. Sci. 1994, 45, 1–44. [Google Scholar]
- Gritta, M. Where Are You Talking About? Advances and Challenges of Geographic Analysis of Text with Application to Disease Monitoring. Ph.D. Thesis, University of Cambridge, Cambridge, UK, 2019. [Google Scholar]
- Bo, A.; Peng, S.; Xinming, T.; Alimu, N. Spatio-temporal visualization system of news events based on GIS. In Proceedings of the IEEE 3rd International Conference on Communication Software and Networks, Xi’an, China, 27–29 May 2011; pp. 448–451. [Google Scholar] [CrossRef]
- Grover, C.; Tobin, R.; Byrne, K.; Woollard, M.; Reid, J.; Dunn, S.; Ball, J. Use of the Edinburgh geoparser for georeferencing digitized historical collections. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2010, 368, 3875–3889. [Google Scholar] [CrossRef] [Green Version]
- Leidner, J.L. Toponym Resolution in Text: Annotation, Evaluation and Applications of Spatial Grounding of Place Names. Ph.D. Dissertation, The University of Edinburgh, Edinburgh, UK, 2007. [Google Scholar]
- Amitay, E.; Har’El, N.; Sivan, R.; Soffer, A. Web-a-Where: Geotagging Web Content. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, Sheffield, UK, 25—29 July 2004; pp. 273–280. [Google Scholar]
- Karimzadeh, M.; Pezanowski, S.; MacEachren, A.M.; Wallgrün, J.O. GeoTxt: A scalable geoparsing system for unstructured text geolocation. Trans. GIS 2019, 23, 118–136. [Google Scholar] [CrossRef]
- Gritta, M.; Pilehvar, M.T.; Collier, N. Which Melbourne? Augmenting Geocoding with Maps. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2018; Volume 1, pp. 1285–1296. [Google Scholar] [CrossRef]
- D’Ignazio, C.; Bhargava, R.; Zuckerman, E.; Beck, L. CLIFF-CLAVIN: Determining Geographic Focus for News. In NewsKDD Data Science for News Publishing; NewsKDD: Data Science for News Publishing, at KDD: New York, NY, USA, 2014. [Google Scholar]
- Lieberman, M.D.; Sperling, J.; Washington, D.C. STEWARD: Architecture of a Spatio-Textual Search Engine. In Proceedings of the 15th annual ACM international symposium on Advances in geographic information systems, Seattle, WA, USA, 7–9 November 2007. [Google Scholar]
- LDC. ACE (Automatic Content Extraction) English Annotation Guidelines for Events V5.4.3 Linguistic Data Consortium. 2005. Available online: https://www.ldc.upenn.edu/collaborations/past-projects/ace (accessed on 8 November 2020).
- Dewandaru, A.; Supriana, S.I.; Akbar, S. Event-Oriented Map Extraction from Web News Portal: Binary Map Case Study on Diphteria Outbreak and Flood in Jakarta. In Proceedings of the 2018 5th International Conference on Advanced Informatics: Concept Theory and Applications (ICAICTA), Krabi, Thailand, 14–17 August 2018; pp. 72–77. [Google Scholar] [CrossRef]
- Ramage, D.; Hall, D.; Nallapati, R.; Manning, C.D. Labeled LDA. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing Volume 1—EMNLP ’09, Stroudsburg, PA, USA, August 2009; pp. 248–256. Available online: https://dl.acm.org/doi/10.5555/1699510.1699543 (accessed on 8 November 2020).
- CLAVIN (Cartographic Location and Vicinity INdexer). Available online: https://github.com/Novetta/CLAVIN (accessed on 8 November 2020).
- Teitler, B.E.; Lieberman, M.D.; Panozzo, D.; Sankaranarayanan, J.; Samet, H.; Sperling, J. NewsStand. In Proceedings of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems GIS ’08, Irvine, CA, USA, 5–7 November 2008; Volume 2008, p. 1. [Google Scholar] [CrossRef]
- Andogah, G.; Bouma, G.; Nerbonne, J. Every document has a geographical scope. Data Knowl. Eng. 2012, 81–82, 1–20. [Google Scholar] [CrossRef]
- Li, H.; Srihari, R.K.; Niu, C.; Li, W. Location normalization for information extraction. In Proceedings of the 19th International Conference on Computational Linguistics; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2002; pp. 1–7. Available online: https://www.aclweb.org/anthology/C02-1127/ (accessed on 8 November 2020).
- Srihari, R.K.; Li, W.; Cornell, T.; Niu, C. InfoXtract: A customizable intermediate level information extraction engine. Nat. Lang. Eng. 2006, 14, 33–69. [Google Scholar] [CrossRef] [Green Version]
- Schrodt, P.A.; Leetaru, K. GDELT: Global Data on Events, Location and Tone, 1979–2012. In Proceedings of the International Studies Association Annual Meeting, San Francisco, CA, USA, 29 March 2013; pp. 1–49. [Google Scholar]
- Leetaru, K.H. Fulltext Geocoding Versus Spatial Metadata for Large Text Archives: Towards a Geographically Enriched Wikipedia. D-Lib Mag. 2012, 18, 1–23. [Google Scholar] [CrossRef]
- Lee, S.J.; Liu, H.; Ward, M.D. Lost in Space: Geolocation in Event Data. Political Sci. Res. Methods 2019, 7, 871–888. [Google Scholar] [CrossRef] [Green Version]
- Handbook of Computational Approaches to Counterterrorism; Springer Science and Business Media LLC: Berlin, Germany, 2013. [CrossRef]
- Halterman, Andrew, Linking Events and Locations in Political Text (1 September 2018). MIT Political Science Department Research Paper No. 2018-21. Available online: https://ssrn.com/abstract=3267476 (accessed on 8 November 2020).
- Imani, M.B.; Chandra, S.; Ma, S.; Khan, L.; Thuraisingham, B. Focus location extraction from political news reports with bias correction. In Proceedings of the 2017 IEEE International Conference on Big Data (Big Data), Institute of Electrical and Electronics Engineers (IEEE), Boston, MA, USA, 11–14 December 2017; pp. 1956–1964. [Google Scholar] [CrossRef]
- Pennington, J.; Socher, R.; Manning, C. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
- Halterman, A. Geolocating Political Events in Text. In Proceedings of the Third Workshop on Natural Language Processing and Computational Social Science, Minneapolis, MN, USA, 6 June 2019; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2019; pp. 29–39. [Google Scholar]
- Yang, B.; Mitchell, T.M. Joint Extraction of Events and Entities within a Document Context. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2016; pp. 289–299. [Google Scholar]
- Leidner, J.L.; Lieberman, M.D. Detecting geographical references in the form of place names and associated spatial natural language. SIGSPATIAL Spéc. 2011, 3, 5–11. [Google Scholar] [CrossRef]
- Geonames.org. “Geonames”. 2020. Available online: https://geonames.org (accessed on 8 November 2020).
- Morton-Owens, E.G. A Tool for Extracting and Indexing Spatio-Temporal Information from Biographical Articles in Wikipedia. 2012. Available online: http://www.cs.nyu.edu/web/Research/MsTheses/owens_emily.pdf (accessed on 8 November 2020).
- Schilder, F.; Versley, Y.; Habel, C. Extracting spatial information: Grounding, classifying and linking spatial expressions. In Proceedings of the workshop on geographic information retrieval at SIGIR 2004, Sheffield, UK, 25–29 July 2004; pp. 1–3. Available online: http://publikationen.stub.uni-frankfurt.de/frontdoor/deliver/index/docId/9959/file/VERSLEY_Extracting_spatial_information.pdf (accessed on 8 November 2020).
- Lan, R.; Adelfio, M.D.; Samet, H. Spatio-temporal disease tracking using news articles. In Proceedings of the Third ACM SIGSPATIAL International Workshop on the Use of GIS in Public Health, HealthGIS, Dallas, TX, USA, 4 November 2014; Volume 14, pp. 31–38. [Google Scholar] [CrossRef]
- Monteiro, B.R.; Davis, C.A.; Fonseca, F. A survey on the geographic scope of textual documents. Comput. Geosci. 2016, 96, 23–34. [Google Scholar] [CrossRef]
- Bensalem, I.; Kholladi, M.-K. Toponym Disambiguation by Arborescent Relationships. J. Comput. Sci. 2010, 6, 653–659. [Google Scholar] [CrossRef] [Green Version]
- Markert, K.; Nissim, M. Towards a corpus annotated for metonymies: The case of location names. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), Las Palmas, Spain, 29–31 May 2002; pp. 1385–1392. [Google Scholar]
- Hogenboom, F. An Overview of Event Extraction from Text. In Proceedings of the Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011), Workshop in conjunction with the 10th International Semantic Web Conference 2011 (ISWC 2011), Bonn, Germany, 23 October 2011. [Google Scholar]
- Pustejovsky, J.; Ingria, R.; Saurí, R.; Castaño, J.M.; Moszkowicz, J.; Katz, M. The Specification Language TimeML; Oxford University Press: Oxford, UK, 2004; pp. 1–15. [Google Scholar]
- Wang, W.; Zhao, D.; Wang, N. Chinese News Event 5W1H Elements Extraction Using Semantic Role Labeling. In Proceedings of the 2010 Third International Symposium on Information Processing, Qingdao, China, 15–17 October 2010; pp. 484–489. [Google Scholar] [CrossRef]
- Khodra, M.L. Event extraction on Indonesian news article using multiclass categorization. In Proceedings of the 2015 2nd International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA), Chonburi, Thailand, 19–22 August 2015; pp. 1–5. [Google Scholar]
- Rauch, E.; Bukatin, M.; Baker, K. A confidence-based framework for disambiguating geographic terms. In Proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic References, Stroudsburg, PA, USA, May 2003; pp. 50–54. Available online: https://dl.acm.org/doi/10.3115/1119394.1119402 (accessed on 8 November 2020).
- Leidner, J.L.; Sinclair, G.; Webber, B. Grounding spatial named entities for information extraction and question answering. In Proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic References, Stroudsburg, PA, USA, May 2003; Available online: https://dl.acm.org/doi/10.3115/1119394.1119399 (accessed on 8 November 2020).
- Habib, M.B.; Van Keulen, M. A Hybrid Approach for Robust Multilingual Toponym Extraction and Disambiguation. In Intelligent Information Systems Symposium; Springer: Berlin/Heidelberg, Germany, 2013; pp. 1–15. [Google Scholar] [CrossRef] [Green Version]
- Nissim, M.; Matheson, C.; Reid, J. Recognizing Geographical Entities in Scottish Historical Documents. In Proceedings of the Workshop on Geographic Information Retrieval at SIGIR 2004, Sheffield, UK, 25–29 July 2004. [Google Scholar]
- Adams, B.; McKenzie, G.; Gahegan, M. Frankenplace: Interactive thematic mapping for ad hoc exploratory search. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18 May 2015; pp. 12–22. [Google Scholar]
- Buscaldi, D. Toponym Disambiguation in Information Retrieval. Ph.D. Dissertation, Polytechnic University of Valencia, Valencia, Spain, 2015. [Google Scholar] [CrossRef] [Green Version]
- Smith, D.A.; Crane, G. Disambiguating Geographic Names in a Historical Digital Library. Comput. Vis. 2001, 2163, 127–136. [Google Scholar] [CrossRef]
- Wei, W.W. Automated Spatiotemporal and Semantic Information Extraction for Hazards. Ph.D. Dissertation, The University of Iowa, Iowa, IA, USA, 2018. [Google Scholar] [CrossRef] [Green Version]
- Wang, J.; Zhang, J.; An, Y.; Lin, H.; Yang, Z.; Zhang, Y.; Sun, Y. Biomedical event trigger detection by dependency-based word embedding. BMC Med. Genom. 2016, 9, 45. [Google Scholar] [CrossRef] [Green Version]
- Blei, D.M.; Carin, L.; Dunson, D.B. Probabilistic Topic Models. IEEE Signal. Process. Mag. 2010, 27, 55–65. [Google Scholar] [CrossRef] [Green Version]
- Řehůřek, R. Scalability of Semantic Analysis in Natural Language Processing. 2011, p. 147. Available online: http://radimrehurek.com/phd_rehurek.pdf (accessed on 8 November 2020).
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Papanikolaou, Y.; Tsoumakas, G. Subset Labeled LDA for Large-Scale Multi-Label Classification. 2017. Available online: https://arxiv.org/abs/1709.05480 (accessed on 8 November 2020).
- Kang, D.; Park, Y.; Chari, S.N. Hetero-Labeled LDA: A Partially Supervised Topic Model with Heterogeneous Labels; Springer Science and Business Media LLC: Berlin, Germany, 2014; Volume I, pp. 640–655. [Google Scholar] [CrossRef]
- Greene, D.; O’Callaghan, D.; Cunningham, P. How Many Topics? Stability Analysis for Topic Models; Springer Science and Business Media LLC: Berlin, Germany, 2014; Volume I, pp. 498–513. [Google Scholar] [CrossRef] [Green Version]
- Mikolov, T.; Corrado, G.; Chen, K.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, AZ, USA, 2–4 May 2013; pp. 12–22. [Google Scholar]
- Leidner, J.L. An evaluation dataset for the toponym resolution task. Comput. Environ. Urban Syst. 2006, 30, 400–417. [Google Scholar] [CrossRef]
- Gritta, M.; Pilehvar, M.T.; Limsopatham, N.; Collier, N. What’s missing in geographical parsing? Lang. Resour. Eval. 2018, 52, 603–623. [Google Scholar] [CrossRef] [Green Version]
- Ha, L.Q.; Hanna, P.; Ming, J.; Smith, F.J. Extending Zipf’s law to n-grams for large corpora. Artif. Intell. Rev. 2009, 32, 101–113. [Google Scholar] [CrossRef]
- Dewandaru, A. Event Geoparsing Indonesian News Dataset. IEEE Dataport. 2020. Available online: https://ieee-dataport.org/open-access/event-geoparsing-indonesian-news-dataset (accessed on 8 November 2020). [CrossRef]
- Bender, E.M.; Lascarides, A. Linguistic Fundamentals for Natural Language Processing II: 100 Essentials from Semantics and Pragmatics. Synth. Lect. Hum. Lang. Technol. 2019, 12, 1–268. [Google Scholar] [CrossRef]
- Schrodt, A.P. Data, PETRARCH: The Successor to TABARI. 2019, pp. 1–3. Available online: http://eventdata.parusanalytics.com/tabari.dir/TABARI.0.8.4b3.manual.pdf (accessed on 8 November 2020).
- GADM Database of Global Administrative Areas, Version 2.0; University of California: Berkeley, CA, USA, 2012.
- Purwarianti, A.; Andhika, A.; Wicaksono, A.F.; Afif, I.; Ferdian, F. InaNLP: Indonesia natural language processing toolkit, case study: Complaint tweet classification. In Proceedings of the 2016 International Conference on Advanced Informatics: Concepts, Theory and Application (ICAICTA), George Town, Malaysia, 16–19 August 2016; pp. 1–5. [Google Scholar] [CrossRef]
- Strohmeyer, D.; Eggers, T.; Haupt, M. Waverider Aerodynamics and Preliminary Design for Two-Stage-to-Orbit Missions, Part 1. J. Spacecr. Rocket. 1998, 35, 450–458. [Google Scholar] [CrossRef]
- Murtaugh, M.A.; Gibson, B.S.; Redd, D.; Zeng-Treitler, Q. Regular expression-based learning to extract bodyweight values from clinical notes. J. Biomed. Inform. 2015, 54, 186–190. [Google Scholar] [CrossRef] [Green Version]
- Yang, J.; Zhang, Y. NCRF + +: An Open-source Neural Sequence Labeling Toolkit. In Proceedings of the ACL 2018, System Demonstrations, Melbourne, Australia, July 2018; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2018; pp. 74–79. Available online: https://www.aclweb.org/anthology/P18-4013/ (accessed on 8 November 2020).
- Lin, J.C.-W.; Shao, Y.; Zhang, J.; Yun, U. Enhanced sequence labeling based on latent variable conditional random fields. Neurocomputing 2020, 403, 431–440. [Google Scholar] [CrossRef]
- Mimno, D.; Wallach, H.M.; Talley, E.; Leenders, M.; McCallum, A. Optimizing Semantic Coherence in Topic Models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, 27–31 July 2011; pp. 262–272. [Google Scholar]
- Mimno, D. Package ‘mallet,’ Comprehensive R Archive Network. 2015, pp. 1–11. Available online: https://cran.r-project.org/web/packages/mallet/mallet.pdf (accessed on 8 November 2020).
- Řehůřek, R. Petr, Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 workshop New Challenges for NLP Frameworks, University of Malta, Valletta, Malta, 22 May 2010; p. 45. [Google Scholar]
- Li, Q.; Ji, H.; Huang, L. Joint Event Extraction via Structured Prediction with Global Features. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, 4–9 August 2013. [Google Scholar]
- McClosky, D.; Surdeanu, M.; Manning, C.D. Event extraction as dependency parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; Volume 1, pp. 1626–1635. [Google Scholar]
Type of Resolution Scope | Output Model Formulation and Illustration t = Toponym, D = Set of Words in the Document, | Example Geoparser/GIR | |
---|---|---|---|
Toponym-level (geocoded toponym) | Output is geographical coordinate for each toponym in the document based on recognized toponyms entities within the document. | SPIRIT [6] Edinburgh Geoparser [11] Spatial Minimality [12] CLAVIN [21] Camcoder [15] | |
Document-Level (geographical scope of document) | Output is single geographic coordinate or scope of the document based on some scoring function of recognized toponyms within the document. | Web-a-Where [13] NewsStand [22] GeoTxt [14] Mahali [23] CLIFF [16] | |
Event-Level(geolocated event) | Output is geographical coordinate for each locations of recognized event(s) within the document. | Mordecai [30] Profile [31] GDELT 1 (TABARI + Leetaru [26]) Petrarch + Mordecai [30] Petrarch + CLIFF [28] InfoXtract [25] + LocNZ [24] |
Tag | Description and Examples |
---|---|
EVE | Description: Event Triggers: word(s) that indicate an event has occurred. Examples: 1. Flood happened in… (Banjir terjadi di…) 2. ….that the fireworks from the band triggered fast-moving fire flame. (… bahwa kembang api dari band, memicu kobaran api yang bergerak cepat.) |
ARG | Description: Non-named arguments related to event. May include numerical or non-numerical arguments. I. Event Arguments for Flood 1. Height of flood: Height-Arg The height of the water reached 2 m… (Ketinggian air yang mencapai 2 m…) 2. Number of Victim (Deaths): DeathVictim-Arg At least 41 people killed due to the flood. (Sedikitnya 41 orang tewas akibat banjir ini.) 3. Number of Evacuee: Evacuee-Arg Indonesian Field Hospital handled 9 victims and 346 evacuees. (Rumah Sakit Indonesia di Nepal Tangani 9 Korban dan Tampung 346 Pengungsi) 4. Number of Affected houses: AffectedHouse-Arg Flooding caused 4991 houses to be submerged… (Banjir menyebabkan sekitar 4.991 rumah terendam…) |
II. Event Arguments for Quake: 1. Magnitude (Richter or MMI unit): Strength-Arg A 5.2 Richter earthquake shakes Maluku waters. (Gempa 5.2 SR Goyang Perairan Maluku) 2. Quake Center: Central-Arg The coordinates of the earthquake are −3.4 Latitude 128.41 Longitude… (Titik koordinat gempa ada di 3.4 Lintang Selatan dan 128,41 Bujur Timur) 3. Quake Depth: Depth-Arg The depth of the earthquake was 10 km. (Kedalaman gempa 10 km) | |
III. Event Arguments for Fire: 1. Number of houses burnt: HouseBurnt-Arg A house at Mampang Prapatan burned… (Sebuah rumah di Mampang Prapatan Terbakar) 2. How many fire hotspots: Point-Arg There are nine fire spots… (ada sembilan titik api…) 3. Units of fire truck dispatched: DispatchedTrucks-Arg …12 firetrucks were dispatched. (…12 damkar dikerahkan) | |
IV. Event Arguments for Accident: 1. License plates: Plate-Arg … which has B 9667 ZX license plate. (…bernopol B 9667 ZX) 2. Vehicle type involved: Vehicle-Arg …hit hard on the back of the truck. (menghantam keras bagian belakang truk) 3. The length of jam: Length-Arg The accident caused up to 2 km traffic jam. (kecelakaan mengakibatkan kepadataan kendaraan hingga 2 km) 4. Origin or Destination: FromTo-Arg The accident caused up to 2 km traffic jam. (kecelakaan mengakibatkan kepadataan kendaraan hingga 2 km) | |
V. Others (may appear in more than one events above): 1. Numerical Monetary loss: MonetaryLoss-Arg The loss is estimated around hundreds of millions of rupiahs. (Kerugian diperkirakan mencapai ratusan juta rupiah.) 2. Time or Date of event: Time-Arg -…at 15.25 WIB. (…pukul 15.25 WIB). -… as reported by AFP news agency on Friday (25/3/2011) (…dilansir kantor berita AFP, Jumat (25/3/2011)) 3. Cause of event: Cause-Arg Example:…caused by River Kuncir overflow. (…disebabkan luapan air Sungai Kuncir) 4. Affected families: AffectedFamily-Arg …which caused 935 families to be affected. (…menyebabkan 935 KK terdampak) 5. Street names: Street-Arg There is a fire near Street KS Tubun Raya behind the Bimo Hotel (ada kebakaran di dekat Jl KS Tubun Raya belakang hotel Bimo) | |
ORG | Organization (such as military or civilians) Rank/Positions within it (pangkat/jabatan) Useful for classifying Pseudo LOCs. Example: BPBD and TNI… (BPBD bersama TNI…) Governor of East Java… (Gubernur Jawa Timur…) |
LOC | Location or Toponym in types of GPE (geopolitical entities) administrative unit. Ranging from lowest administrative level to highest (Village, Sub District, Municipalities/Cities, Province, Country). Pseudo Locaction will be labeled as PLOC. Example: The flood again submerged 11 villages in Gandusari Subdistrict, Trenggalek Regency. (Banjir kembali merendam 11 desa di kecamatan Gandusari Kabupaten Trenggalek.) |
O | Other Entities |
Type | Label | Count | Type | Label | Count |
---|---|---|---|---|---|
Entity | LOC | 454 | Argument | Duration-Arg | 31 |
PLOC | 627 | AffectedVehicle-Arg | 23 | ||
EVE | 700 | AffectedFacility-Arg | 8 | ||
ARG | 2016 | AffectedField-Arg | 10 | ||
ORG | 571 | AffectedPeople-Arg | 15 | ||
Event | ACCIDENT-EVENT | 131 | AffectedVillage-Arg | 35 | |
FLOOD-EVENT | 207 | AffectedRT-Arg | 4 | ||
QUAKE-EVENT | 128 | AffectedFacility-Arg | 8 | ||
FIRE-EVENT | 180 | Time-Arg | 422 | ||
LANDSLIDE-EVENT | 18 | Published-Arg | 79 | ||
MEETING-EVENT | 4 | Reporter-Arg | 39 | ||
JAM-EVENT | 16 | Evacuee-Arg | 20 | ||
Argument | Vehicle-Arg | 121 | Spot-Arg | 9 | |
Hospital-Arg | 50 | DeathVictim-Arg | 201 | ||
Place-Arg | 1070 | WoundVictim-Arg | 29 | ||
Street-Arg | 159 | MonetaryLoss-Arg | 3 | ||
Cause-Arg | 52 | OfficerOfficial-Arg | 408 | ||
FromTo-Arg | 57 | Depth-Arg | 28 | ||
Plate-Arg | 70 | Central-Arg | 89 |
Category | Feature Code | Type and Source | Features |
---|---|---|---|
Event Keywords | event | Semantic Relatedness from ATM | Binary feature Is_Event_Keyword(w): whether a word is included in shortlisted trigger words or not. Composed of four (geospatial) events word bigrams list: flood, quake, fire, and accidents. |
Smallest Administrative Level (SAL) | sal | Geographical Feature | Is_SAL(t): whether a mentioned toponym has the smallest administrative level in the document |
Gazetteer Detection | gaz | Geographical Resource: GADM database + US_Cities | Is_Toponym(w): a word is listed in Hierarchical Gazetteer or not. |
Regex for detecting arguments | arg_ regex | Semantic Similarity from Word2Vec & Semantic Relatedness from ATM | Regex Rules, composed of the following rules to detect patterns of these types: 1. Is_Time 2. Is_Plate_Number 3. Is_Coordinate 4. Is_Numeric 5. Is_Road 6. Is_Geographical 7. Is_Date 8. Is_Day 9. Is_Vehicle |
Regex for detecting organizational entities | org_ regex | Semantic Similarity from Word2Vec | Regex Rule, composed of the following rules to detect patterns of these types: 1. Place Types 2. Public Office Positions |
POS Tag | postag | Syntactic Resource: INANLP | 1. First level POS Tag (e.g., NN) 2. Full level POS Tag (e.g., NNP) 3. Word Form: is_Upper 4. Word Form: is_Digit 5. Word Form: is_TitleCase |
Entity | entity | Output labels from entity extraction step | LOC, EVE, ARG, ORG |
Event | event | Output labels from event trigger classification step | FLOOD-EVENT, FIRE-EVENT, QUAKE-EVENT, ACCIDENT, EVENT |
Argument | arg | Output labels from argument extraction step | (see Table 3) |
Semantic Similarity Vector (Top 5) | Semantic Relatedness Vector (Top 5) | Bigrams and Count Vector (Top 5) |
---|---|---|
semantic similarity of “quake”/gempa:
| semantic related of “quake”/gempa:
| bigrams of “quake”/gempa:
|
Entity | (Baseline) LSTM-CRF with Gaz + Postag Features | (Proposed) LSTM-CRF with Org_Regex + Arg_Regex + Ev_Keywords + Gaz + Postag Features | ||||
---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | |
LOC | 0.929 | 0.897 | 0.912 | 0.951 | 0.897 | 0.923 |
ARG | 0.762 | 0.767 | 0.764 | 0.857 | 0.709 | 0.776 |
ORG | 0.697 | 0.847 | 0.765 | 0.787 | 0.847 | 0.816 |
EVE | 0.850 | 0.888 | 0.869 | 0.855 | 0.925 | 0.889 |
micro avg | 0.797 | 0.830 | 0.813 | 0.863 | 0.811 | 0.836 |
macro avg | 0.809 | 0.850 | 0.828 | 0.863 | 0.845 | 0.851 |
weighted avg | 0.801 | 0.830 | 0.814 | 0.865 | 0.811 | 0.834 |
Event | (Baseline) LSTM-CRF with Gaz + Postag Features | (Proposed) LSTM-CRF with Entity Features | ||||
---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | |
ACCIDENT-EVENT | 1.000 | 1.000 | 1.000 | 0.962 | 1.000 | 0.980 |
FIRE-EVENT | 0.806 | 0.967 | 0.879 | 0.968 | 1.000 | 0.984 |
FLOOD-EVENT | 0.886 | 0.861 | 0.873 | 1.000 | 0.972 | 0.986 |
QUAKE-EVENT | 0.885 | 0.793 | 0.836 | 1.000 | 1.000 | 1.000 |
micro avg | 0.885 | 0.900 | 0.893 | 0.983 | 0.992 | 0.988 |
macro avg | 0.894 | 0.905 | 0.897 | 0.982 | 0.993 | 0.987 |
weighted avg | 0.889 | 0.900 | 0.892 | 0.984 | 0.992 | 0.988 |
Event | (Baseline) CRF with Gaz + Postag Features | (Proposed) CRF with Entity + Event Features | ||||
---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | |
DeathVictim-Arg | 0.615 | 0.381 | 0.471 | 0.760 | 0.905 | 0.826 |
Vehicle-Arg | 0.625 | 0.435 | 0.513 | 1.000 | 0.913 | 0.955 |
Height-Arg | 0.875 | 0.700 | 0.778 | 0.833 | 1.000 | 0.909 |
OfficerOfficial-Arg | 0.711 | 0.678 | 0.694 | 0.849 | 0.839 | 0.844 |
Time-Arg | 0.927 | 0.962 | 0.944 | 1.000 | 1.000 | 1.000 |
Place-Arg | 0.873 | 0.832 | 0.852 | 0.900 | 0.884 | 0.892 |
Street-Arg | 0.708 | 0.810 | 0.756 | 0.526 | 0.952 | 0.678 |
Strength-Arg | 0.952 | 1.000 | 0.976 | 1.000 | 1.000 | 1.000 |
micro avg | 0.822 | 0.766 | 0.793 | 0.869 | 0.912 | 0.890 |
macro avg | 0.786 | 0.725 | 0.748 | 0.859 | 0.937 | 0.888 |
weighted avg | 0.812 | 0.766 | 0.785 | 0.884 | 0.912 | 0.894 |
Tag | (Baseline) LSTM-CRF with Gaz + Postag Feature | (Proposed) LSTM-CRF with Sal + Event + Arg Feature | ||||
---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | |
PLOC | 0.835 | 0.784 | 0.809 | 0.971 | 0.879 | 0.923 |
LOC | 0.528 | 0.655 | 0.585 | 0.809 | 0.948 | 0.873 |
micro avg | 0.713 | 0.741 | 0.727 | 0.908 | 0.902 | 0.905 |
macro avg | 0.681 | 0.720 | 0.697 | 0.890 | 0.914 | 0.898 |
weighted avg | 0.733 | 0.741 | 0.734 | 0.917 | 0.902 | 0.906 |
Top Words | Model | K | Coherence | |
---|---|---|---|---|
Flood Topic | Quake Topic | |||
20 | Labeled LDA (LLDA) (15 K only) | 2588 | −201.01 | −187.46 |
20 | Aggregated Topic Model (ATM) | 44,280 | −393.96 | −394.31 |
20 | LDA Gibbs K = 100 | 100 | −421.87 | −424.72 |
20 | LDA Gibbs K = 600 | 600 | −285.51 | −397.41 |
20 | LDA VB K = 600 | 600 | −453.82 | −417.77 |
Algorithm | Spatial Minimality | SMCD-ADM |
---|---|---|
Toponyms tested | 791 | 792 |
Correct Disambiguation | 561 | 588 |
Accuracy | 0.70 | 0.74 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Dewandaru, A.; Widyantoro, D.H.; Akbar, S. Event Geoparser with Pseudo-Location Entity Identification and Numerical Argument Extraction Implementation and Evaluation in Indonesian News Domain. ISPRS Int. J. Geo-Inf. 2020, 9, 712. https://doi.org/10.3390/ijgi9120712
Dewandaru A, Widyantoro DH, Akbar S. Event Geoparser with Pseudo-Location Entity Identification and Numerical Argument Extraction Implementation and Evaluation in Indonesian News Domain. ISPRS International Journal of Geo-Information. 2020; 9(12):712. https://doi.org/10.3390/ijgi9120712
Chicago/Turabian StyleDewandaru, Agung, Dwi Hendratmo Widyantoro, and Saiful Akbar. 2020. "Event Geoparser with Pseudo-Location Entity Identification and Numerical Argument Extraction Implementation and Evaluation in Indonesian News Domain" ISPRS International Journal of Geo-Information 9, no. 12: 712. https://doi.org/10.3390/ijgi9120712