Abstract
Over last few years, CAPTCHAs are ubiquitously found on internet as a security mechanism to distinguish between humans and spams. The text-based CAPTCHAs offer users to recognize the distorted text from the challenged images. Having based on hard AI problem, they have emerged as a hot research topic in computer vision and machine learning. The contemporary text-based CAPTCHAs are based on the segmentation problem that involves their decomposition into sub-images of individual characters. This is a challenging task for current OCR programs which is not yet solved to a great extent. In this paper, we present a novel segmentation and recognition method which uses simple image processing techniques including thresholding, thinning and pixel count methods along with an artificial neural network for text-based CAPTCHAs. We attack the popular CCT (Crowded Characters Together) based CAPTCHAs and compare our results with other schemes. As overall, our system achieves an overall precision of 51.3, 27.1 and 53.2% for Taobao, MSN and eBay datasets with 1000,500 and 1000 CAPTCHAs respectively. The benefits of this research are twofold: by recognizing text-based CAPTCHAs, we not only explore the weaknesses in the current design but also find a way to segment and recognize the connected characters from images. The proposed algorithm can be used in digitization of ancient books, handwriting recognition and other similar tasks.
Similar content being viewed by others
References
Ahn LV, Blum M, John L (2004) Telling humans and computers apart automatically. Commun ACM 47(2):56–60
Blumenstein M, Verma B, Basli H (2003) A novel feature extraction technique for the recognition of segmented handwritten characters. In: Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference (pp. 137–141). IEEE
Bursztein E, Martin M, Mitchell J (2011) Text-based CAPTCHA strengths and weaknesses. In: Proceedings of the 18th ACM conference on Computer and communications security, pp. 125–138. ACM
Chandavale AA, Sapkal A (2012) A new approach towards segmentation for breaking CAPTCHA. In: International Conference on Security in Computer Networks and Distributed Systems (pp. 323–335). Springer Berlin Heidelberg
Chellapilla K, Larson K, Simard PY, Czerwinski M (2005) Building segmentation based human-friendly human interaction proofs (HIPs), Human Interactive Proofs pp. 1–26. Springer, Berlin Heidelberg
El Ahmad AS, Yan J, Tayara M (2011) The robustness of Google CAPTCHA’s. Computing Science, Newcastle University
Fang K, Bu Z, Xia ZY (2012) Segmentation of CAPTCHAs based on complex networks. In: International Conference on Artificial Intelligence and Computational Intelligence (pp. 735–743). Springer Berlin Heidelberg
Gao H, Wang X, Cao F, Zhang Z, Lei L, Qi J, Liu X (2016) Robustness of text-based completely automated public turing test to tell computers and humans apart. IET Inf Secur 10(1):45–52
Gao H, Wang W, Fan Y, Qi J, Liu X (2014) The Robustness of “Connecting Characters Together” CAPTCHAs. J Inf Sci Eng 30(2):347–369
Gaurav DD, Ramesh R (2012). A feature extraction technique based on character geometry for character recognition. arXiv preprint arXiv:1202.3884
Huang SY, Lee YK, Bell G, Ou ZH (2010) “An efficient segmentation algorithm for CAPTCHAs”, with line cluttering and character warping. Multimed Tools Appl 48(2):267–289
Mori G, Malik J (2003) Recognizing objects in adversarial clutter: Breaking a visual CAPTCHA. In: Computer Vision and Pattern Recognition, (Vol. 1, pp. I-134). Proceedings of IEEE Computer Society Conference IEEE
Otsu N (1975) A threshold selection method from gray-level histograms. Automatica 11:285–296
Simard PY (2004) Using machine learning to break visual human interaction proofs. Adv Neural Inf Proces Syst 17:265–272
Starostenko O, Cruz-Perez C, Uceda-Ponga F, Alarcon-Aquino V (2015) Breaking text-based CAPTCHAs with variable word and character orientation. Pattern Recogn 48(4):1101–1112
Yan J, El Ahmad AS (2008) A low-cost attack on a microsoft CAPTCHA. In: Proceedings of the 15th ACM conference on Computer and communications security (pp. 543–554) ACM
Zhang TY, Suen CY (1984) A fast parallel algorithm for thinning digital patterns. Commun ACM 27(3):236–239
Zhang H, Wen X (2014) The recognition of CAPTCHA based on fuzzy matching. In: Foundations of Intelligent Systems (pp. 759–768). Springer Berlin Heidelberg
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hussain, R., Gao, H. & Shaikh, R.A. Segmentation of connected characters in text-based CAPTCHAs for intelligent character recognition. Multimed Tools Appl 76, 25547–25561 (2017). https://doi.org/10.1007/s11042-016-4151-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-016-4151-2