Learning typographic style: from discrimination to synthesis

Shumeet Baluja¹

949 Accesses
17 Citations
Explore all metrics

Abstract

Typography is a ubiquitous art form that affects our understanding, perception and trust in what we read. Thousands of different font-faces have been created with enormous variations in the characters. In this paper, we learn the style of a font by analyzing a small subset of only four letters. From these four letters, we learn two tasks. The first is a discrimination task: given the four letters and a new candidate letter, does the new letter belong to the same font? Second, given the four basis letters, can we generate all of the other letters with the same characteristics as those in the basis set? We use deep neural networks to address both tasks, quantitatively and qualitatively measure the results in a variety of novel manners, and present a thorough investigation of the weaknesses and strengths of the approach. All of the experiments are conducted with publicly available font sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Review of Various Neural Style Transfer Methods: A Comparative Study

Computational Decomposition of Style for Controllable and Enhanced Style Transfer

Dissecting Neural Networks Filter Responses for Artistic Style Transfer

Notes

Wang et al. have studied retrieving fonts found in photographs [55]. Extraction from photographs is not addressed in this paper. However, the underlying task of font retrieval will be presented in the experimental section.
The subjective evaluation was conducted by an independent user experience researcher (UER) volunteer not affiliated with this project. The UER was given a paper copy of the input letters, the generated letters and the actual letter. The UER was asked to evaluate the ‘R’ along the 3 dimensions listed above. Additionally, for control, the UER was also given examples (not shown here) which included real ‘R’s in order to minimize bias. The UER was not paid for this experiment.
Other version of multi-task learning can incorporate different error metrics or a larger diversity of tasks. In this case, the multiple tasks are closely related, though still provide the benefit of task transfer.
For completeness, we also analyzed the ‘R’s generated by the one-letter-at-a-time networks. They had similar performance (when measured with D) to the ‘R’ row shown in Table 3, with (6%) higher SSE.
This line of inquiry was sparked by discussions with Zhangyang Wang.
The substantial processes of segmenting, cleaning, centering and pre-processing the fonts from photographs are beyond the scope of this paper. We solely address the retrieval portion of this task. We do this by assuming the target font can be cleaned and segmented to yield input grayscale images such as used in this study. For a review of character segmentation, please see [10].

References

10000Fontscom (2016) Download. http://www.10000fonts.com/catalog/
Aucouturier, J.J., Pachet, F.: Representing musical genre: a state of the art. J. New Music Res. 32(1), 83–93 (2003)
Article Google Scholar
Bengio, Y.: Deep learning of representations for unsupervised and transfer learning. Unsupervised Transf. Learn. Chall. Mach Learn. 7, 19 (2012)
Google Scholar
Bernhardsson, E.: Analyzing 50k fonts using deep neural networks. http://erikbern.com/2016/01/21/analyzing-50k-fonts-using-deep-neural-networks/ (2016)
Beymer, D., Russell, D., Orton, P.: An eye tracking study of how font size and type influence online reading. In: Proceedings of the 22nd British HCI Group Annual Conference on People and Computers: Culture, Creativity, Interaction-Volume 2, pp. 15–18. British Computer Society, (2008)
Bowey, M.: A 20 minute intro to typography basics. http://design.tutsplus.com/articles/a-20-minute-intro-to-typography-basics--psd-3326 (2009)
Bowey, M.: A fontastic voyage: generative fonts with adversarial networks. http://multithreaded.stitchfix.com/blog/2016/02/02/a-fontastic-voyage (2016)
Campbell, N.D., Kautz, J.: Learning a manifold of fonts. ACM Trans. Graph. (TOG) 33(4), 91 (2014)
Article Google Scholar
Caruana, R.: Multitask learning. Mach. Learn. 28(1), 41–75 (1997)
Article MathSciNet Google Scholar
Casey, R.G., Lecolinet, E.: A survey of methods and strategies in character segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 18(7), 690–706 (1996)
Article Google Scholar
Ciresan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image classification. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3642–3649. IEEE, (2012)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Denton, E.L., Chintala, S., Szlam, A., Fergus, R.: Deep generative image models using a laplacian pyramid of adversarial networks. In: Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R (eds.) Advances in Neural Information Processing Systems 28, pp. 1486–1494. Curran Associates, Inc. http://papers.nips.cc/paper/5773-deep-generative-image-models-using-a-laplacian-pyramid-of-adversarial-networks.pdf (2015)
Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image super-resolution. In: Computer Vision–ECCV, pp. 184–199. Springer, (2014)
Dosovitskiy, A., Brox, T.: Generating Images with Perceptual Similarity Metrics based on Deep Networks. ArXiv e-prints arXiv:1602.02644 (2016)
Eck, D., Schmidhuber, J.: A first look at music composition using LSTM recurrent neural networks. Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale 103 (2002)
Feng, J.C., Tse, C., Qiu, Y.: Wavelet-transform-based strategy for generating new chinese fonts. In: Proceedings of the 2003 International Symposium on Circuits and Systems, 2003. ISCAS’03, vol. 4, pp. IV–IV . IEEE, (2003)
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al.: Devise: a deep visual-semantic embedding model. In: Advances in Neural Information Processing Systems, pp. 2121–2129. (2013)
Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. CoRR arXiv:1508.06576 (2015)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680. (2014)
Graves, A.: Generating sequences with recurrent neural networks. CoRR arXiv:1308.0850 (2013)
Graves, A., Wayne, G., Danihelka, I.: Neural turing machines. arXiv preprint arXiv:1410.5401 (2014)
Gregor, K., Danihelka, I., Graves, A., Wierstra, D.: DRAW: a recurrent neural network for image generation. CoRR arXiv:1502.04623 (2015)
Gygli, M., Grabner, H., Riemenschneider, H., Nater, F., Gool, L.: The interestingness of images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1633–1640 (2013)
Ha, S.: The influence of design factors on trust in a bank’s website. Digital Repository@ Iowa State University, http://lib.dr.iastate.edu/etd/10858/ (2009)
Hassan, T., Hu, C., Hersch, R.D.: Next generation typeface representations: revisiting parametric fonts. In: Proceedings of the 10th ACM symposium on Document engineering, pp. 181–184. ACM, (2010)
Hu, C., Hersch, R.D.: Parameterizable fonts based on shape components. IEEE Comput. Graph. Appl. 21(3), 70–85 (2001)
Google Scholar
Huang, J.T., Li, J., Yu, D., Deng, L., Gong, Y.: Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7304–7308 . IEEE, (2013)
Hutchings, E.: Typeface timeline shows us the history of fonts. http://www.psfk.com/2012/04/history-of-fonts.html (2014)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
Im, J.D., Kim, C.D., Jiang, H., Memisevic, R.: Generating images with recurrent adversarial networks. arXiv e-prints arXiv:1602.05110 (2016)
Joachims, T.: Making large scale SVM learning practical. Universität Dortmund, Technical report (1999)
Karayev, S., Hertzmann, A., Winnemoeller, H., Agarwala, A., Darrell, T.: Recognizing image style. CoRR arXiv:1311.3715 (2013)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732. (2014)
Khosla, A., Das Sarma, A., Hamid, R.: What makes an image popular? In: Proceedings of the 23rd International Conference on World Wide Web, pp. 867–876. ACM, (2014)
Kwok, K.W., Wong, S.M., Lo, K.W., Yam, Y.: Genetic algorithm-based brush stroke generation for replication of Chinese calligraphic character. In: IEEE Congress on Evolutionary Computation, 2006. CEC 2006, pp. 1057–1064 . IEEE, (2006)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
Li, Y.M., Yeh, Y.S.: Increasing trust in mobile commerce through design aesthetics. Comput. Hum. Behav. 26(4), 673–684 (2010)
Article Google Scholar
Lun, Z., Kalogerakis, E., Sheffer, A.: Elements of style: learning perceptual shape style similarity. ACM Trans. Graph. (TOG) 34(4), 84 (2015)
Article Google Scholar
Miyazaki, T., Tsuchiya, T., Sugaya, Y., Omachi, S., Iwamura, M., Uchida, S., Kise, K.: Automatic generation of typographic font from a small font subset. CoRR arXiv:1701.05703 (2017)
Mordvintsev, A., Olah, C., Tyka, M.: Inceptionism: going deeper into neural networks. http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html (2015)
Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1717–1724. (2014)
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: Proceedings of The 33rd International Conference on Machine Learning, vol. 3. (2016)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.S., Berg, A.C., Li, F.: Imagenet large scale visual recognition challenge. CoRR arXiv:1409.0575 (2014)
Schapire, R.E., Freund, Y., Bartlett, P., Lee, W.S.: Boosting the margin: a new explanation for the effectiveness of voting methods. Ann. Stat. 26(5), 1651–1686 (1998)
Article MathSciNet MATH Google Scholar
Suveeranont, R., Igarashi, T.: Example-based automatic font generation. In: International Symposium on Smart Graphics, pp. 127–138. Springer (2010)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Tenenbaum, J.B., Freeman, W.T.: Separating style and content with bilinear models. Neural Comput. 12(6), 1247–1283 (2000)
Article Google Scholar
Tschichold, J.: Treasury of alphabets and lettering: a source book of the best letter forms of past and present for sign painters, graphic artists, commercial artists, typographers, printers, sculptors, architects, and schools of art and design. A Norton professional book, Norton. https://books.google.com/books?id=qQB7lqSrpnoC (1995)
Upchurch, P., Snavely, N., Bala, K.: From A to Z: supervised transfer of style and content using deep neural network generators. ArXiv e-prints arXiv:1603.02003 (2016)
Van Santen, J.P., Sproat, R., Olive, J., Hirschberg, J.: Progress in Speech Synthesis. Springer, Berlin (2013)
MATH Google Scholar
Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B., Wu, Y.: Learning fine-grained image similarity with deep ranking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1386–139. (2014)
Wang, M.: Multi-path convolutional neural networks for complex image classification. CoRR arXiv:1506.04701 (2015)
Wang, Y., Wang, H., Pan, C., Fang, L.: Style preserving Chinese character synthesis based on hierarchical representation of character. In: IEEE International Conference on Acoustics, Speech and Signal Processing, 2008. ICASSP 2008, pp. 1097–1100. IEEE, (2008)
Wang, Z., Yang, J., Jin, H., Shechtman, E., Agarwala, A., Brandt, J., Huang, T.S.: Deepfont: identify your font from an image. In: Proceedings of the 23rd ACM international conference on Multimedia, pp. 451–459. ACM, (2015)
Willats, J., Durand, F.: Defining pictorial style: lessons from linguistics and computer graphics. Axiomathes 15(3), 319–351 (2005)
Article Google Scholar
Xu, L., Ren, J.S., Liu, C., Jia, J.: Deep convolutional neural network for image deconvolution. In: Advances in Neural Information Processing Systems, pp. 1790–1798. (2014)
Zaremba, W., Sutskever, I.: Learning to execute. arXiv preprint arXiv:1410.4615 (2014)
Zeiler, M.D., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q.V., Nguyen, P., Senior, A., Vanhoucke, V., Dean, J., et al.: On rectified linear units for speech processing. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 3517–3521. IEEE, (2013)
Zong, A., Zhu, Y.: Strokebank: automating personalized Chinese handwriting generation. In: AAAI, pp. 3024–3030 (2014)

Download references

Acknowledgements

Please see Figure 11.

Author information

Authors and Affiliations

Google, Inc., Mountain View, CA, USA
Shumeet Baluja

Authors

Shumeet Baluja
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shumeet Baluja.

Ethics declarations

Conflict of interest

The author declares that he has no conflict of interest.

Appendices

Appendix 1: Font retrieval

In this study, we have concentrated on the task of synthesizing fonts and the assessment of the quality through the use of a discrimination ensemble. An alternate use of the discrimination ensemble is to retrieve fonts that are visually similar to a target font.

The font retrieval task arises when an unlabeled (target) font is encountered and a user/designer wants to identify the font. Manually searching through ten thousand or more candidate font samples is a daunting task. The problem is exacerbated by the fact that an exact match to the font seen may not be publicly available, in which case similar fonts should be returned.

To address this task, recall that the font discrimination ensemble is trained to determine whether a candidate letter is the same font as four input letters, BASQ. Although, in this study, we used the ensemble as a binary classifier, the font discrimination networks and ensemble have real-valued outputs (e.g., they yield a score); this information can be used for font retrieval. Recently, [55] used a neural network approach to find fonts similar to those that may appear in photographs. Here, we perform the analogous task—given a sample of a font (for the networks trained here, the sample includes the letters BASQ), is it possible to search through the entire database of fonts and find the exact, or similar, fonts?^{Footnote 6}

Scoring for retrieval works as follows. Like before, the discriminator ensemble takes 5 letters as input. The first four are BASQ from the known font. The fifth letter is the candidate letter to determine whether it is the same font. For this retrieval task, the first four letters are from the target font. For each font in the larger font pool, each font’s letter is, in turn, used as the fifth letter to input into the discriminator ensemble. The similarity of the candidate letter is then recorded. The candidate font’s total similarity to the target font is simply the summed (real-value) outputs of the discriminator ensemble across all the letters in the candidate font. This simple interpretation of the discrimination scores as similarity measurements yields positive results for retrieval; see Figure 12.

Unfortunately, due to the large differences in size and type of the dataset used by [55], the results shown here should not be directly compared. Our candidate retrieval pool is more than $4\times $ as large and contains a much larger degree of diversity in font styles (many are ‘non-professional’ and are developed for specific artistic purposes, including the aforementioned ransom and picture fonts).

It is interesting to note that simpler methods, such as comparing pixel difference (for example, $L_2$ pixel-wise difference on each character’s image), also yield reasonable answers on some of the fonts because it captures the overall thickness of the strokes and the size of the characters. However, general stylistic similarities may not be captured; for examples, see Figure 13.

In summary, this appendix is included to show the potential for using the trained networks in a novel way—for font retrieval. Recall that despite the fact that these networks were not trained to measure similarity, preliminary results show promise with no modification to the networks. In the future, training networks to explicitly measure similarity or to create task-specific embeddings to measure distances should be explored [18, 52].

Appendix 2: Neural repair/sharpening

The most common unwanted artifacts in the synthesized characters are blur and missing connections between portions of the characters when there is an exceptionally thin stroke width. The first problem can be partially alleviated through the use of ‘off-the-shelf’ image sharpening and thresholding tools such as those found in many consumer-level photograph-editing packages. The second problem requires more domain-specific knowledge.

One of the primary aims of the paper was to elucidate the strengths and limitations of a straight-forward deep neural network approach to font synthesis. This should provide a strong baseline to which more sophisticated algorithms can be measured. In this ‘Appendix,’ we briefly introduce one extension to the procedures presented in the main body of the paper that help alleviate the problems noted above.

To address both problems of blur and missing connections, we created a secondary network, termed the repair network. This network takes as input: the inputs to the generation networks and the synthesized outputs from the generation networks. The output of repair network is 26 letters that have, hopefully, removed some of the unwanted artifacts.

The network architecture is exactly the same as the synthesis network, augmented with the additional inputs. As before, the hidden layer is shared between the output letters. The training procedure again employs the same fonts that were used in the main study to ensure that no other external, extra, information is introduced into procedure. To summarize, the training pairs are:

input (30 letters): original (BASQ) + synthesized (A-Z)

target outputs (26 letters): original (A-Z)

Other than the additional inputs, the most salient difference in the training algorithm is that in addition to the glyph-reconstruction $L_2$ error used in the training of the original synthesis networks, a secondary loss function is added. The outputs (pixels) are penalized in proportion to their distance from either (0.0 or 1.0). The penalty, for output x, where $0\le x \le 1$, is: $penalty = 0.5 - |0.5-x|$. This function has the property that the maximum penalty is when the output, x, has a value of 0.5 and no penalty when x is either 0.0 or 1.0. This encourages the outputs to move away from the pixel activations that appear as blur. Note that because the reconstruction error is also present, adding this extra penalty does not necessarily drive the outputs to a purely binary values.

Ten sets of outputs from this repair network are shown in Fig. 14. Figure 14 includes results that demonstrate both successful repairs and repairs that had no effect. Several of the fonts have also been shown earlier in the main body of this paper to highlight differences that the repair network can make. Additionally, we found a rare case in which there was a degradation in a glyph.

The results are promising; many unwanted artifacts have been successfully removed. However, as shown in row 10, this particular sharpener is not yet a panacea for all the repairs needed for the output of the letter generation networks. In many glyphs, there was little effect on the outputs. Importantly, however, in the vast majority of trials, the results were not visually degraded, only improved or unchanged. In the future, exploration of this additional mechanism is open for further research along at least two avenues. The first includes integrating this error metric directly into the synthesis networks so that a separate network is not required. The second is developing ‘repair’ networks that are task/domain specific and that can be applied to other image generation tasks—perhaps even when the image generation itself is not done with a neural network.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Baluja, S. Learning typographic style: from discrimination to synthesis. Machine Vision and Applications 28, 551–568 (2017). https://doi.org/10.1007/s00138-017-0842-6

Download citation

Received: 26 July 2016
Revised: 01 March 2017
Accepted: 21 April 2017
Published: 09 May 2017
Issue Date: August 2017
DOI: https://doi.org/10.1007/s00138-017-0842-6