Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3284869.3284879acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgoodtechsConference Proceedingsconference-collections
research-article

Extracting Learning Outcomes Using Machine Learning and White Space Analysis

Published: 28 November 2018 Publication History

Abstract

Increasing use of Portable Document Format (PDF) files has promoted research in analysing its layout for text extraction purposes. In this paper, we propose an algorithm for extracting text from PDF documents while considering document layout. Using this algorithm, we extract learning outcomes from academic course outlines. This research is aimed at automating the process of assigning credits to transfer students, which is currently done manually. The system has shown promising results and has an accuracy of 81.8%. The algorithm has a wide scope of application and takes a step towards automating the task of text extraction from PDF documents.

References

[1]
James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research 13, Feb (2012), 281--305.
[2]
Mr Brijain, R Patel, Mr Kushik, and K Rana. 2014. A survey on decision tree algorithm for classification. (2014).
[3]
Marc Claesen and Bart De Moor. 2015. Hyperparameter search in machine learning. arXiv preprint arXiv:1502.02127 (2015).
[4]
Google Developers. 2018. Custom Search JSON API. https://developers.google.com/custom-search/json-api/v1/overview,.
[5]
Mahmoud El-Haj, Paul Rayson, Steven Young, and Martin Walker. 2014. Detecting document structure in a very large corpus of UK financial reports. In: Proceedings of the ninth international conference on language resources and evaluation (2014), 1335--1338.
[6]
Liangcai Gao, Zhi Tang, Xiaofan Lin, Ying Liu, Ruiheng Qiu, and Yongtao Wang. 2011. Structure extraction from PDF-based book documents. In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries. ACM, 11--20.
[7]
Haibo He and Edwardo A Garcia. 2009. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering 21, 9 (2009), 1263--1284.
[8]
Declan Kennedy. 2006. Writing and using learning outcomes: a practical guide. University College Cork.
[9]
Svetlana Lazebnik, A Torralba, L Fei-Fei, D Lowe, and C Szurka. 2011. Bag of words models. Dostopno na: http://cs.nyu.edu/~fergus/teaching/vision_2012/9_BoW.pdf 3 (2011).
[10]
Xiaofan Lin. 2002. Header and footer extraction by page association. In Proceedings of SPIE 5010. 164--171.
[11]
Rokach Lior et al. 2014. Data mining with decision trees: theory and applications. Vol. 81. World scientific.
[12]
Atish Pawar and Vijay Mago. 2018. Calculating the similarity between words and sentences using a lexical database and corpus statistics. arXiv preprint arXiv:1802.05667 (2018).
[13]
Apache PDFBox. 2014. Apache PDFBox. https://pdfbox.apache.org/. Accessed on 02.06.2018.
[14]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research 12, Oct(2011), 2825--2830.
[15]
Fuad Rahman and Hassan Alam. 2003. Conversion of PDF documents into HTML: a case study of document image analysis. In Signals, Systems and Computers, 2004. Conference Record of the Thirty-Seventh Asilomar Conference on, Vol. 1. IEEE, 87--91.
[16]
Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy, and Gully APC Burns. 2012. Layout-aware text extraction from full-text PDF of scientific articles. Source Code for Biology and Medicine 7, 1 (2012), 7.
[17]
S. Singh Budhiraja and V. Mago. 2018. A Supervised Learning Approach For Heading Detection. ArXiv e-prints (Aug. 2018). arXiv:cs.IR/1809.01477
[18]
Linda Suskie. 2010. Assessing student learning: A common sense guide.
[19]
Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2016. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.

Cited By

View all
  • (2020)A supervised learning approach for heading detectionExpert Systems10.1111/exsy.1252037:4Online publication date: 8-Jan-2020
  • (2019)Syntactic, Semantic and Sentiment Analysis: The Joint Effect on Automated Essay EvaluationIEEE Access10.1109/ACCESS.2019.29333547(108486-108503)Online publication date: 2019

Index Terms

  1. Extracting Learning Outcomes Using Machine Learning and White Space Analysis

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    Goodtechs '18: Proceedings of the 4th EAI International Conference on Smart Objects and Technologies for Social Good
    November 2018
    316 pages
    ISBN:9781450365819
    DOI:10.1145/3284869
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    • EAI: The European Alliance for Innovation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 November 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Document Analysis
    2. Supervised Learning
    3. Text Extraction
    4. White Space Analysis

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    Goodtechs '18

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 19 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)A supervised learning approach for heading detectionExpert Systems10.1111/exsy.1252037:4Online publication date: 8-Jan-2020
    • (2019)Syntactic, Semantic and Sentiment Analysis: The Joint Effect on Automated Essay EvaluationIEEE Access10.1109/ACCESS.2019.29333547(108486-108503)Online publication date: 2019

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media