research-article

Extracting code from programming tutorial videos

Authors:

Eran YahavAuthors Info & Claims

Onward! 2016: Proceedings of the 2016 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software

Pages 98 - 111

https://doi.org/10.1145/2986012.2986021

Published: 20 October 2016 Publication History

Abstract

The number of programming tutorial videos on the web increases daily. Video hosting sites such as YouTube host millions of video lectures, with many programming tutorials for various languages and platforms. These videos contain a wealth of valuable information, including code that may be of interest. However, two main challenges have so far prevented the effective indexing of programming tutorial videos: (i) code in tutorials is typically written on-the-fly, with only parts of the code visible in each frame, and (ii) optical character recognition (OCR) is not precise enough to produce quality results from videos. We present a novel approach for extracting code from videos that is based on: (i) consolidating code across frames, and (ii) statistical language models for applying corrections at different levels, allowing us to make corrections by choosing the most likely token, combination of tokens that form a likely line structure, and combination of lines that lead to a likely code fragment in a particular language. We implemented our approach in a tool called ACE, and used it to extract code from 40 Android video tutorials on YouTube. Our evaluation shows that ACE extracts code with high accuracy, enabling deep indexing of video tutorials.

References

[1]

Watch people code, 2016.

[2]

A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., 2006.

Digital Library

[3]

G. Bradski and A. Kaehler. Learning OpenCV: Computer Vision with the OpenCV Library. O’Reilly Media, Inc., 2008.

[4]

M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, 1996.

Digital Library

[5]

A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness of software. In 34th International Conference on Software Engineering, ICSE. IEEE, 2012.

Digital Library

[6]

W. Hu, N. Xie, L. Li, X. Zeng, and S. Maybank. A survey on visual content-based video indexing and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2011.

Digital Library

[7]

X.-S. Hua, P. Yin, and H.-J. Zhang. Efficient video text recognition using multiple frame integration. In Procceding on the International Conference on Image Processing. IEEE, 2002.

[8]

R. Lienhart and A. Wernicke. Localizing and segmenting text in images and videos. IEEE Transactions on Circuits and Systems for Video Technology, 2002.

Digital Library

[9]

T. Lu, S. Palaiahnakote, C. L. Tan, and W. Liu. Video Text Detection. Advances in Computer Vision and Pattern Recognition. Springer, 2014.

Digital Library

[10]

L. MacLeod, M.-A. Storey, and A. Bergen. Code, camera, action: how software developers document and share program knowledge using YouTube. In Proceedings of the IEEE 23rd International Conference on Program Comprehension, 2015.

Digital Library

[11]

M. Merler and J. R. Kender. Semantic keyword extraction via adaptive text binarization of unstructured unsourced video. In 16th IEEE International Conference on Image Processing, ICIP, 2009.

Digital Library

[12]

T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. A statistical semantic language model for source code. In Proceedings of the 9th Joint Meeting on Foundations of Software Engineering, ESEC/FSE. ACM, 2013.

Digital Library

[13]

T. Parr. The Definitive ANTLR Reference: Building Domain-Specific Languages. Pragmatic Bookshelf, 2007.

Digital Library

[14]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 2011.

Digital Library

[15]

L. Ponzanelli, G. Bavota, A. Mocci, M. Di Penta, R. Oliveto, B. Russo, S. Haiduc, and M. Lanza. Codetube: extracting relevant fragments from software development video tutorials. In Proceedings of the 38th International Conference on Software Engineering Companion. ACM, 2016.

Digital Library

[16]

V. Raychev, M. Vechev, and E. Yahav. Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, 2014.

Digital Library

[17]

R. Rosenfeld. Two decades of statistical language modeling: where do we go from here? Proceedings of the IEEE, 2000.

[18]

R. Smith. An overview of the tesseract ocr engine. In Proc. Ninth Int. Conference on Document Analysis and Recognition (ICDAR), 2007.

Digital Library

[19]

K. Taghva, T. Nartker, A. Condit, and J. Borsack. Automatic removal of “garbage strings” in OCR text: An implementation. In The 5th World Multi-Conference on Systemics, Cybernetics and Informatics, 2001.

[20]

R. Teitelbaum. Context-free error analysis by evaluation of algebraic power series. In Proceedings of the Fifth Annual ACM Symposium on Theory of Computing, STOC, 1973.

Digital Library

[21]

X. Tong and D. A. Evans. A statistical approach to automatic OCR error correction in context. In Proceedings of the Fourth Workshop on Very Large Corpora, 1996.

[22]

Z. Tu, Z. Su, and P. Devanbu. On the localness of software. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE, 2014.

Digital Library

[23]

H. Yang, M. Siebert, P. Luhne, H. Sack, and C. Meinel. Lecture video indexing and analysis using video OCR technology. In Seventh International Conference on Signal-Image Technology and Internet-Based Systems, SITIS. IEEE, 2011.

Digital Library

[24]

L. Zhuang, T. Bao, X. Zhu, C. Wang, and S. Naoi. A chinese OCR spelling check approach based on statistical language models. In IEEE International Conference on Systems, Man and Cybernetics, 2004.

Cited By

Alahmadi MAlshangiti M(2024)Optimizing OCR Performance for Programming Videos: The Role of Image Super-Resolution and Large Language ModelsMathematics10.3390/math1207103612:7(1036)Online publication date: 30-Mar-2024
https://doi.org/10.3390/math12071036
Ouyang YShen LWang YLi Q(2024)NotePlayer: Engaging Computational Notebooks for Dynamic Presentation of Analytical ProcessesProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676410(1-20)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3654777.3676410
Alahmadi M(2023)VID2XML: Automatic Extraction of a Complete XML Data From Mobile Programming ScreencastsIEEE Transactions on Software Engineering10.1109/TSE.2022.318889849:4(1726-1740)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TSE.2022.3188898
Show More Cited By

Index Terms

Extracting code from programming tutorial videos
1. Social and professional topics
  1. Professional topics
    1. Computing education
      1. Computing education programs
        Computer science education
        Information science education

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

Onward! 2016: Proceedings of the 2016 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software

October 2016

268 pages

ISBN:9781450340762

DOI:10.1145/2986012

General Chair:
Eelco Visser
Delft University of Technology, Netherlands
,
Program Chairs:
Emerson Murphy-Hill
North Carolina State University, USA
,
Crista Lopes
University of California at Irvine, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

In-Cooperation

SIGAda: ACM Special Interest Group on Ada Programming Language

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SPLASH '16

Sponsor:

SIGPLAN

SPLASH '16: Conference on Systems, Programming, Languages, and Applications: Software for Humanity

November 2 - 4, 2016

Amsterdam, Netherlands

Acceptance Rates

Overall Acceptance Rate 40 of 105 submissions, 38%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

40
Total Citations
View Citations
422
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)5

Reflects downloads up to 22 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Alahmadi MAlshangiti M(2024)Optimizing OCR Performance for Programming Videos: The Role of Image Super-Resolution and Large Language ModelsMathematics10.3390/math1207103612:7(1036)Online publication date: 30-Mar-2024
https://doi.org/10.3390/math12071036
Ouyang YShen LWang YLi Q(2024)NotePlayer: Engaging Computational Notebooks for Dynamic Presentation of Analytical ProcessesProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676410(1-20)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3654777.3676410
Alahmadi M(2023)VID2XML: Automatic Extraction of a Complete XML Data From Mobile Programming ScreencastsIEEE Transactions on Software Engineering10.1109/TSE.2022.318889849:4(1726-1740)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TSE.2022.3188898
Malkadi ATayeb AHaiduc S(2023)Improving Code Extraction from Coding Screencasts Using a Code-Aware Encoder-Decoder Model2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)10.1109/ASE56229.2023.00184(1492-1504)Online publication date: 11-Sep-2023
https://doi.org/10.1109/ASE56229.2023.00184
Alghamdi OClinch SSkeva RJay C(2023)How are websites used during development and what are the implications for the coding process?Journal of Systems and Software10.1016/j.jss.2023.111803205(111803)Online publication date: Nov-2023
https://doi.org/10.1016/j.jss.2023.111803
Yang CThung FLo D(2022)Efficient Search of Live-Coding Screencasts from Online Videos2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER53432.2022.00021(73-77)Online publication date: Mar-2022
https://doi.org/10.1109/SANER53432.2022.00021
Moslehi PRilling JAdams B(2022)A user survey on the adoption of crowd-based software engineering instructional screencasts by the new generation of software developersJournal of Systems and Software10.1016/j.jss.2021.111144185:COnline publication date: 6-May-2022
https://dl.acm.org/doi/10.1016/j.jss.2021.111144
Del Bonifro FGabbrielli MLategano AZacchiroli S(2021)Image-based many-language programming language identificationPeerJ Computer Science10.7717/peerj-cs.6317(e631)Online publication date: 23-Jul-2021
https://doi.org/10.7717/peerj-cs.631
Chattopadhyay SFord DZimmermann T(2021)Developers Who Vlog: Dismantling Stereotypes through Community and IdentityProceedings of the ACM on Human-Computer Interaction10.1145/34795305:CSCW2(1-33)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3479530
Chattopadhyay SZimmermann TFord DSpinellis DGousios GChechik MDi Penta M(2021)Reel life vs. real life: how software developers share their daily life through vlogsProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3468264.3468599(404-415)Online publication date: 20-Aug-2021
https://dl.acm.org/doi/10.1145/3468264.3468599
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents