Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3442381.3449923acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Generating Accurate Caption Units for Figure Captioning

Published: 03 June 2021 Publication History

Abstract

Scientific-style figures are commonly used on the web to present numerical information. Captions that tell accurate figure information and sound natural would significantly improve figure accessibility. In this paper, we present promising results on machine figure captioning. A recent corpus analysis of real-world captions reveals that machine figure captioning systems should start by generating accurate caption units. We formulate the caption unit generation problem as a controlled captioning problem. Given a caption unit type as a control signal, a model generates an accurate caption unit of that type. As a proof-of-concept on single bar charts, we propose a model, FigJAM, that achieves this goal through utilizing metadata information and a joint static and dynamic dictionary. Quantitative evaluations with two datasets from the figure question answering task show that our model can generate more accurate caption units than competitive baseline models. A user study with ten human experts confirms the value of machine-generated caption units in their standalone accuracy and naturalness. Finally, a post-editing simulation study demonstrates the potential for models to paraphrase and stitch together single-type caption units into multi-type captions by learning from data.

References

[1]
Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C Lawrence Zitnick, Devi Parikh, and Dhruv Batra. 2016. VQA: Visual Question Answering. International Journal of Computer Vision 1, 123 (2016), 4–31.
[2]
Rabah A Al-Zaidy, Sagnik Ray Choudhury, and C Lee Giles. 2016. Automatic summary generation for scientific data charts. In Workshops at AAAI.
[3]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
[4]
Kasumi Aoki and Ichiro Kobayashi. 2016. Linguistic summarization using a weighted n-gram language model based on the similarity of time-series data. In 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE). IEEE, 595–601.
[5]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.
[6]
Jeffrey P Bigham, Erin L Brady, Cole Gleason, Anhong Guo, and David A Shamma. 2016. An Uninteresting Tour Through Why Our Research Papers Aren’t Accessible. In CHI-EA.
[7]
Ali Furkan Biten, Lluis Gomez, Marçal Rusinol, and Dimosthenis Karatzas. 2019. Good News, Everyone! Context driven entity-aware captioning for news images. In CVPR.
[8]
Benjamin S Bloom 1956. Taxonomy of educational objectives. Vol. 1: Cognitive domain. New York: McKay (1956), 20–24.
[9]
Erin Brady, Yu Zhong, and Jeffrey P. Bigham. 2015. Creating Accessible PDFs for Conference Proceedings. In Proceedings of the 12th Web for All Conference.
[10]
Ritwick Chaudhry, Sumit Shekhar, Utkarsh Gupta, Pranav Maneriker, Prann Bansal, and Ajay Joshi. 2019. LEAF-QA: Locate, Encode & Attend for Figure Question Answering. arXiv preprint arXiv:1907.12861(2019).
[11]
C. Chen, R. Zhang, E. Koh, S. Kim, S. Cohen, and R. Rossi. 2020. Figure Captioning with Relation Maps for Reasoning. In IEEE Winter Conference on Applications of Computer Vision (WACV).
[12]
Charles Chen, Ruiyi Zhang, Eunyee Koh, Sungchul Kim, Scott Cohen, Tong Yu, Ryan A. Rossi, and Razvan C. Bunescu. 2019. Figure Captioning with Reasoning and Sequence-Level Training. arXiv preprint arXiv:1906.02850(2019).
[13]
Daniel Chester and Stephanie Elzer. 2005. Getting computers to see information graphics so users do not have to. In International Symposium on Methodologies for Intelligent Systems. Springer, 660–668.
[14]
Marc Corio and Guy Lapalme. 1999. Generation of texts for information graphics. In Proceedings of the 7th European Workshop on Natural Language Generation.
[15]
Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2019. Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions. In CVPR.
[16]
Seniz Demir, Sandra Carberry, and Kathleen F. McCoy. 2008. Generating Textual Summaries of Bar Charts. In INLG.
[17]
Stephanie Elzer, Sandra Carberry, Daniel Chester, Seniz Demir, Nancy Green, Ingrid Zukerman, and Keith Trnka. 2005. Exploring and exploiting the limited utility of captions in recognizing intention in information graphics. In ACL.
[18]
Stephanie Elzer, Sandra Carberry, and Ingrid Zukerman. 2011. The Automated Understanding of Simple Bar Charts. Artif. Intell. 175, 2 (Feb. 2011), 526–555.
[19]
Massimo Fasciano and Guy Lapalme. 1996. Postgraphe: a system for the generation of statistical graphics and text. In INLG.
[20]
Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, and Li Deng. 2017. Stylenet: Generating attractive visual captions with styles. In CVPR.
[21]
David Grangier and Michael Auli. 2018. QuickEdit: Editing Text & Translations by Crossing Words Out. In NAACL HLT.
[22]
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. 2016. Incorporating Copying Mechanism in Sequence-to-Sequence Learning. In ACL.
[23]
Braden Hancock, Hongrae Lee, and Cong Yu. 2019. Generating Titles for Web Tables. In WWW.
[24]
Matthew Honnibal and Mark Johnson. 2015. An Improved Non-monotonic Transition System for Dependency Parsing. In EMNLP.
[25]
Morten Jessen, Falk Böschen, and Ansgar Scherp. 2019. Text Localization in Scientific Figures using Fully Convolutional Neural Networks on Limited Training Data. In DocEng.
[26]
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR.
[27]
Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. In CVPR.
[28]
Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. 2018. DVQA: Understanding data visualizations via question answering. In CVPR.
[29]
Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Ákos Kádár, Adam Trischler, and Yoshua Bengio. 2017. Figureqa: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300(2017).
[30]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR.
[31]
Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, and In So Kweon. 2019. Dense relational captioning: Triple-stream networks for relationship-based captioning. In CVPR.
[32]
Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. 2017. A hierarchical approach for generating descriptive image paragraphs. In CVPR.
[33]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.
[34]
Wenqiang Lei, Xisen Jin, Min-Yen Kan, Zhaochun Ren, Xiangnan He, and Dawei Yin. 2018. Sequicity: Simplifying Task-oriented Dialogue Systems with Single Sequence-to-Sequence Architectures. In ACL.
[35]
Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS.
[36]
James H Martin and Daniel Jurafsky. 2009. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Pearson/Prentice Hall Upper Saddle River.
[37]
Alexander Patrick Mathews, Lexing Xie, and Xuming He. 2016. Senticap: Generating image descriptions with sentiments. In AAAI.
[38]
Kathleen F McCoy, Sandra Carberry, Tom Roper, and Nancy Green. 2001. Towards generating textual summaries of graphs.
[39]
Vibhu O. Mittal, Johanna D. Moore, Giuseppe Carenini, and Steven Roth. 1998. Describing Complex Charts in Natural Language: A Caption Generation System. Computational Linguistics 24, 3 (1998), 431–467.
[40]
Priscilla Moraes, Gabriel Sina, Kathy McCoy, and Sandra Carberry. 2014. Generating summaries of line graphs. In INLG.
[41]
Neha Nayak, Dilek Hakkani-Tür, Marilyn Walker, and Larry Heck. 2017. To Plan or not to Plan? Discourse Planning in Slot-Value Informed Sequence to Sequence Models for Language Generation. Proc. Interspeech (2017), 3339–3343.
[42]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL.
[43]
Jorge Piazentin Ono, Ray (Sungsoo) Hong, Claudio T. Silva, and Juliana Freire. 2019. Why should we teach machines to read charts made for humans?. In Human-Centered Machine Learning Perspectives Workshop, CHI.
[44]
Jorge Poco and Jeffrey Heer. 2017. Reverse-Engineering Visualizations: Recovering Visual Encodings from Chart Images. EuroVis (2017).
[45]
Xin Qian, Eunyee Koh, Fan Du, Sungchul Kim, and Joel Chan. 2020. A Formative Study on Designing Accurate and Natural Figure Captioning Systems. In CHI-EA.
[46]
Steven F. Roth, John Kolojejchick, Joe Mattis, and Jade Goldstein. 1994. Interactive Graphic Design Using Automatic Presentation Knowledge. In CHI.
[47]
Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. 2017. A simple neural network module for relational reasoning. In NIPS.
[48]
Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get To The Point: Summarization with Pointer-Generator Networks. In ACL.
[49]
Noah Siegel, Zachary Horvitz, Roie Levin, Santosh Divvala, and Ali Farhadi. 2016. FigureSeer: Parsing result-figures in research papers. In ECCV.
[50]
Michel Simard, Cyril Goutte, and Pierre Isabelle. 2007. Statistical Phrase-Based Post-Editing. In NAACL.
[51]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2014. Show and Tell: A Neural Image Caption Generator. arXiv preprint arXiv:1411.4555(2014).
[52]
W3C 2019. A First Review of Web Accessibility. https://www.w3.org/WAI/test-evaluate/preliminary/images
[53]
Tsung-Hsien Wen, Milica Gasic, Nikola Mrkšić, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems. In EMNLP.
[54]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML.
[55]
Qian Yang, Aaron Steinfeld, Carolyn Rosé, and John Zimmerman. 2020. Re-examining whether, why, and how human-ai interaction is uniquely difficult to design. In CHI.
[56]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In CVPR.
[57]
Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In CVPR.
[58]
Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Josh Tenenbaum. 2018. Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. In NIPS.
[59]
Yue Zheng, Yali Li, and Shengjin Wang. 2019. Intention Oriented Image Captions with Guiding Objects. In CVPR.

Cited By

View all
  • (2024)“Malicious” Pictorials: How Alt Text Matters to Screen Reader Users' Experience of Image-Dense MediaProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3660747(1262-1274)Online publication date: 1-Jul-2024
  • (2024)SciCapenter: Supporting Caption Composition for Scientific Figures with Machine-Generated Captions and RatingsExtended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613905.3650738(1-9)Online publication date: 11-May-2024
  • (2024)Natural Language Dataset Generation Framework for Visualizations Powered by Large Language ModelsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642943(1-22)Online publication date: 11-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '21: Proceedings of the Web Conference 2021
April 2021
4054 pages
ISBN:9781450383127
DOI:10.1145/3442381
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data visualization
  2. figure question answering
  3. image captioning
  4. text generation
  5. web accessibility

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

WWW '21
Sponsor:
WWW '21: The Web Conference 2021
April 19 - 23, 2021
Ljubljana, Slovenia

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)83
  • Downloads (Last 6 weeks)8
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)“Malicious” Pictorials: How Alt Text Matters to Screen Reader Users' Experience of Image-Dense MediaProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3660747(1262-1274)Online publication date: 1-Jul-2024
  • (2024)SciCapenter: Supporting Caption Composition for Scientific Figures with Machine-Generated Captions and RatingsExtended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613905.3650738(1-9)Online publication date: 11-May-2024
  • (2024)Natural Language Dataset Generation Framework for Visualizations Powered by Large Language ModelsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642943(1-22)Online publication date: 11-May-2024
  • (2024)VisCollage: Annotative Collages for Organizing Data Event Charts2024 IEEE 17th Pacific Visualization Conference (PacificVis)10.1109/PacificVis60374.2024.00036(262-271)Online publication date: 23-Apr-2024
  • (2023)Comparing Natural Language and Vibro-Audio Modalities for Inclusive STEM Learning with Blind and Low Vision UsersProceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3597638.3608429(1-17)Online publication date: 22-Oct-2023
  • (2023)WebSHAP: Towards Explaining Any Machine Learning Models AnywhereCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587362(262-266)Online publication date: 30-Apr-2023
  • (2023)Augmenting Visualizations with Predictive and Investigative Insights to Facilitate Decision MakingCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587317(77-81)Online publication date: 30-Apr-2023
  • (2023)A Topic-aware Summarization Framework with Different Modal Side InformationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591630(1416-1425)Online publication date: 19-Jul-2023
  • (2023)What Is the Difference Between a Mountain and a Molehill? Quantifying Semantic Labeling of Visual Features in Line Charts2023 IEEE Visualization and Visual Analytics (VIS)10.1109/VIS54172.2023.00041(161-165)Online publication date: 21-Oct-2023
  • (2023)EC: A Tool for Guiding Chart and Caption EmphasisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332715030:1(120-130)Online publication date: 3-Nov-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media