research-article

Generating Accurate Caption Units for Figure Captioning

Authors:

Tak Yeon LeeAuthors Info & Claims

WWW '21: Proceedings of the Web Conference 2021

Pages 2792 - 2804

https://doi.org/10.1145/3442381.3449923

Published: 03 June 2021 Publication History

Abstract

Scientific-style figures are commonly used on the web to present numerical information. Captions that tell accurate figure information and sound natural would significantly improve figure accessibility. In this paper, we present promising results on machine figure captioning. A recent corpus analysis of real-world captions reveals that machine figure captioning systems should start by generating accurate caption units. We formulate the caption unit generation problem as a controlled captioning problem. Given a caption unit type as a control signal, a model generates an accurate caption unit of that type. As a proof-of-concept on single bar charts, we propose a model, FigJAM, that achieves this goal through utilizing metadata information and a joint static and dynamic dictionary. Quantitative evaluations with two datasets from the figure question answering task show that our model can generate more accurate caption units than competitive baseline models. A user study with ten human experts confirms the value of machine-generated caption units in their standalone accuracy and naturalness. Finally, a post-editing simulation study demonstrates the potential for models to paraphrase and stitch together single-type caption units into multi-type captions by learning from data.

References

[1]

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C Lawrence Zitnick, Devi Parikh, and Dhruv Batra. 2016. VQA: Visual Question Answering. International Journal of Computer Vision 1, 123 (2016), 4–31.

[2]

Rabah A Al-Zaidy, Sagnik Ray Choudhury, and C Lee Giles. 2016. Automatic summary generation for scientific data charts. In Workshops at AAAI.

[3]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.

[4]

Kasumi Aoki and Ichiro Kobayashi. 2016. Linguistic summarization using a weighted n-gram language model based on the similarity of time-series data. In 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE). IEEE, 595–601.

Digital Library

[5]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.

[6]

Jeffrey P Bigham, Erin L Brady, Cole Gleason, Anhong Guo, and David A Shamma. 2016. An Uninteresting Tour Through Why Our Research Papers Aren’t Accessible. In CHI-EA.

[7]

Ali Furkan Biten, Lluis Gomez, Marçal Rusinol, and Dimosthenis Karatzas. 2019. Good News, Everyone! Context driven entity-aware captioning for news images. In CVPR.

[8]

Benjamin S Bloom 1956. Taxonomy of educational objectives. Vol. 1: Cognitive domain. New York: McKay (1956), 20–24.

[9]

Erin Brady, Yu Zhong, and Jeffrey P. Bigham. 2015. Creating Accessible PDFs for Conference Proceedings. In Proceedings of the 12th Web for All Conference.

Digital Library

[10]

Ritwick Chaudhry, Sumit Shekhar, Utkarsh Gupta, Pranav Maneriker, Prann Bansal, and Ajay Joshi. 2019. LEAF-QA: Locate, Encode & Attend for Figure Question Answering. arXiv preprint arXiv:1907.12861(2019).

[11]

C. Chen, R. Zhang, E. Koh, S. Kim, S. Cohen, and R. Rossi. 2020. Figure Captioning with Relation Maps for Reasoning. In IEEE Winter Conference on Applications of Computer Vision (WACV).

[12]

Charles Chen, Ruiyi Zhang, Eunyee Koh, Sungchul Kim, Scott Cohen, Tong Yu, Ryan A. Rossi, and Razvan C. Bunescu. 2019. Figure Captioning with Reasoning and Sequence-Level Training. arXiv preprint arXiv:1906.02850(2019).

[13]

Daniel Chester and Stephanie Elzer. 2005. Getting computers to see information graphics so users do not have to. In International Symposium on Methodologies for Intelligent Systems. Springer, 660–668.

Digital Library

[14]

Marc Corio and Guy Lapalme. 1999. Generation of texts for information graphics. In Proceedings of the 7th European Workshop on Natural Language Generation.

[15]

Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2019. Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions. In CVPR.

[16]

Seniz Demir, Sandra Carberry, and Kathleen F. McCoy. 2008. Generating Textual Summaries of Bar Charts. In INLG.

[17]

Stephanie Elzer, Sandra Carberry, Daniel Chester, Seniz Demir, Nancy Green, Ingrid Zukerman, and Keith Trnka. 2005. Exploring and exploiting the limited utility of captions in recognizing intention in information graphics. In ACL.

[18]

Stephanie Elzer, Sandra Carberry, and Ingrid Zukerman. 2011. The Automated Understanding of Simple Bar Charts. Artif. Intell. 175, 2 (Feb. 2011), 526–555.

Digital Library

[19]

Massimo Fasciano and Guy Lapalme. 1996. Postgraphe: a system for the generation of statistical graphics and text. In INLG.

[20]

Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, and Li Deng. 2017. Stylenet: Generating attractive visual captions with styles. In CVPR.

[21]

David Grangier and Michael Auli. 2018. QuickEdit: Editing Text & Translations by Crossing Words Out. In NAACL HLT.

[22]

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. 2016. Incorporating Copying Mechanism in Sequence-to-Sequence Learning. In ACL.

[23]

Braden Hancock, Hongrae Lee, and Cong Yu. 2019. Generating Titles for Web Tables. In WWW.

[24]

Matthew Honnibal and Mark Johnson. 2015. An Improved Non-monotonic Transition System for Dependency Parsing. In EMNLP.

[25]

Morten Jessen, Falk Böschen, and Ansgar Scherp. 2019. Text Localization in Scientific Figures using Fully Convolutional Neural Networks on Limited Training Data. In DocEng.

[26]

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR.

[27]

Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. In CVPR.

[28]

Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. 2018. DVQA: Understanding data visualizations via question answering. In CVPR.

[29]

Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Ákos Kádár, Adam Trischler, and Yoshua Bengio. 2017. Figureqa: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300(2017).

[30]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR.

[31]

Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, and In So Kweon. 2019. Dense relational captioning: Triple-stream networks for relationship-based captioning. In CVPR.

[32]

Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. 2017. A hierarchical approach for generating descriptive image paragraphs. In CVPR.

[33]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.

Digital Library

[34]

Wenqiang Lei, Xisen Jin, Min-Yen Kan, Zhaochun Ren, Xiangnan He, and Dawei Yin. 2018. Sequicity: Simplifying Task-oriented Dialogue Systems with Single Sequence-to-Sequence Architectures. In ACL.

[35]

Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS.

[36]

James H Martin and Daniel Jurafsky. 2009. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Pearson/Prentice Hall Upper Saddle River.

[37]

Alexander Patrick Mathews, Lexing Xie, and Xuming He. 2016. Senticap: Generating image descriptions with sentiments. In AAAI.

[38]

Kathleen F McCoy, Sandra Carberry, Tom Roper, and Nancy Green. 2001. Towards generating textual summaries of graphs.

[39]

Vibhu O. Mittal, Johanna D. Moore, Giuseppe Carenini, and Steven Roth. 1998. Describing Complex Charts in Natural Language: A Caption Generation System. Computational Linguistics 24, 3 (1998), 431–467.

Digital Library

[40]

Priscilla Moraes, Gabriel Sina, Kathy McCoy, and Sandra Carberry. 2014. Generating summaries of line graphs. In INLG.

[41]

Neha Nayak, Dilek Hakkani-Tür, Marilyn Walker, and Larry Heck. 2017. To Plan or not to Plan? Discourse Planning in Slot-Value Informed Sequence to Sequence Models for Language Generation. Proc. Interspeech (2017), 3339–3343.

[42]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL.

[43]

Jorge Piazentin Ono, Ray (Sungsoo) Hong, Claudio T. Silva, and Juliana Freire. 2019. Why should we teach machines to read charts made for humans?. In Human-Centered Machine Learning Perspectives Workshop, CHI.

[44]

Jorge Poco and Jeffrey Heer. 2017. Reverse-Engineering Visualizations: Recovering Visual Encodings from Chart Images. EuroVis (2017).

[45]

Xin Qian, Eunyee Koh, Fan Du, Sungchul Kim, and Joel Chan. 2020. A Formative Study on Designing Accurate and Natural Figure Captioning Systems. In CHI-EA.

[46]

Steven F. Roth, John Kolojejchick, Joe Mattis, and Jade Goldstein. 1994. Interactive Graphic Design Using Automatic Presentation Knowledge. In CHI.

[47]

Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. 2017. A simple neural network module for relational reasoning. In NIPS.

[48]

Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get To The Point: Summarization with Pointer-Generator Networks. In ACL.

[49]

Noah Siegel, Zachary Horvitz, Roie Levin, Santosh Divvala, and Ali Farhadi. 2016. FigureSeer: Parsing result-figures in research papers. In ECCV.

[50]

Michel Simard, Cyril Goutte, and Pierre Isabelle. 2007. Statistical Phrase-Based Post-Editing. In NAACL.

[51]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2014. Show and Tell: A Neural Image Caption Generator. arXiv preprint arXiv:1411.4555(2014).

[52]

W3C 2019. A First Review of Web Accessibility. https://www.w3.org/WAI/test-evaluate/preliminary/images

[53]

Tsung-Hsien Wen, Milica Gasic, Nikola Mrkšić, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems. In EMNLP.

[54]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML.

[55]

Qian Yang, Aaron Steinfeld, Carolyn Rosé, and John Zimmerman. 2020. Re-examining whether, why, and how human-ai interaction is uniquely difficult to design. In CHI.

[56]

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In CVPR.

[57]

Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In CVPR.

[58]

Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Josh Tenenbaum. 2018. Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. In NIPS.

[59]

Yue Zheng, Yali Li, and Shengjin Wang. 2019. Intention Oriented Image Captions with Guiding Objects. In CVPR.

Cited By

Yin ASogani ROewel BPhan KPark JYeo MYazzolino LArcos KAbdolrahmani ABlank EGilbert MBranham S(2024)“Malicious” Pictorials: How Alt Text Matters to Screen Reader Users' Experience of Image-Dense MediaProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3660747(1262-1274)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3643834.3660747
Hsu THuang CHuang SRossi RKim SYu TGiles CHuang T(2024)SciCapenter: Supporting Caption Composition for Scientific Figures with Machine-Generated Captions and RatingsExtended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613905.3650738(1-9)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613905.3650738
Ko HJeon HPark GKim DKim NKim JSeo J(2024)Natural Language Dataset Generation Framework for Visualizations Powered by Large Language ModelsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642943(1-22)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642943
Show More Cited By

Recommendations

A Formative Study on Designing Accurate and Natural Figure Captioning Systems
CHI EA '20: Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems

Automatic figure captioning is widely useful for improving the readability and accessibility of figures. Despite recent advances in figure question answering and parsing figure elements that enable machines to accurately read information from figures, ...
Tell as You Imagine: Sentence Imageability-Aware Image Captioning
MultiMedia Modeling
Abstract
Image captioning as a multimedia task is advancing in terms of performance in generating captions for general purposes. However, it remains difficult to tailor generated captions to different applications. In this paper, we propose a sentence ...
A neural image captioning model with caption-to-images semantic constructor
Abstract
The current dominant image captioning models are mostly based on a CNN-LSTM encoder-decoder framework. Although this architecture has achieved remarkable progress, it still has shortcomings for not fully capturing the encoded image ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '21: Proceedings of the Web Conference 2021

April 2021

4054 pages

ISBN:9781450383127

DOI:10.1145/3442381

Editors:
Jure Leskovec
Stanford
,
Marko Grobelnik
Jožef Stefan Institute
,
Marc Najork
Google
,
Jie Tang
Tsinghua University
,
Leila Zia
Wikimedia Foundation

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '21

Sponsor:

SIGWEB

WWW '21: The Web Conference 2021

April 19 - 23, 2021

Ljubljana, Slovenia

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
384
Total Downloads

Downloads (Last 12 months)83
Downloads (Last 6 weeks)8

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yin ASogani ROewel BPhan KPark JYeo MYazzolino LArcos KAbdolrahmani ABlank EGilbert MBranham S(2024)“Malicious” Pictorials: How Alt Text Matters to Screen Reader Users' Experience of Image-Dense MediaProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3660747(1262-1274)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3643834.3660747
Hsu THuang CHuang SRossi RKim SYu TGiles CHuang T(2024)SciCapenter: Supporting Caption Composition for Scientific Figures with Machine-Generated Captions and RatingsExtended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613905.3650738(1-9)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613905.3650738
Ko HJeon HPark GKim DKim NKim JSeo J(2024)Natural Language Dataset Generation Framework for Visualizations Powered by Large Language ModelsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642943(1-22)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642943
Li XHung YPan JLin W(2024)VisCollage: Annotative Collages for Organizing Data Event Charts2024 IEEE 17th Pacific Visualization Conference (PacificVis)10.1109/PacificVis60374.2024.00036(262-271)Online publication date: 23-Apr-2024
https://doi.org/10.1109/PacificVis60374.2024.00036
Brown JDoore SDimmel JGiudice NGiudice N(2023)Comparing Natural Language and Vibro-Audio Modalities for Inclusive STEM Learning with Blind and Low Vision UsersProceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3597638.3608429(1-17)Online publication date: 22-Oct-2023
https://dl.acm.org/doi/10.1145/3597638.3608429
Wang ZChau D(2023)WebSHAP: Towards Explaining Any Machine Learning Models AnywhereCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587362(262-266)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543873.3587362
Rony MDu FRossi RHoffswell JChhaya NBurhanuddin IKoh E(2023)Augmenting Visualizations with Predictive and Investigative Insights to Facilitate Decision MakingCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587317(77-81)Online publication date: 30-Apr-2023
https://doi.org/10.1145/3543873.3587317
Chen XLi MGao SCheng XYang QZhang QGao XZhang XChen HDuh WHuang HKato MMothe JPoblete B(2023)A Topic-aware Summarization Framework with Different Modal Side InformationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591630(1416-1425)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591630
Bromley DSetlur V(2023)What Is the Difference Between a Mountain and a Molehill? Quantifying Semantic Labeling of Visual Features in Line Charts2023 IEEE Visualization and Visual Analytics (VIS)10.1109/VIS54172.2023.00041(161-165)Online publication date: 21-Oct-2023
https://doi.org/10.1109/VIS54172.2023.00041
Kim DChoi SKim JSetlur VAgrawala M(2023)EC: A Tool for Guiding Chart and Caption EmphasisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332715030:1(120-130)Online publication date: 3-Nov-2023
https://dl.acm.org/doi/10.1109/TVCG.2023.3327150
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents