What's new in TeX, part 2

Benefits for LWN subscribers
The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

October 28, 2015

This article was contributed by Lee Phillips

This is the second half of our look at how development around the venerable TeX document-preparation system has evolved in recent years. The first article dealt with TeX engines and packages that offer support for modern font handling, scripting, and graph plotting. This time, we will explore work that seeks to bridge the gap between the printed-page world addressed by the original TeX system and the web-centric needs of authors and publishers today.

Background

When TeX first arrived on the scene, it was a revelation. Those who took the time to master it could create beautiful mathematical documents by themselves, without engaging the services of a professional typesetter or a secretary. At that time, and for some time thereafter, "document" meant "something printed on paper."

Advances in hardware and software soon changed that definition—but also made TeX even more valuable. LaTeX automatically numbered our equations and sections for us, eliminating the drudgery of renumbering dozens of references after adding or removing an equation. New tools allowed me to preview my thesis chapters right on the screen, without having to print them out; now, of course, this is routine. PDF output engines became standard and, eventually, computers became fast enough to scroll through PDFs fluidly, making the dream of a paperless collection of scholarly literature a reality.

Around the same time, there was another revolution underway. I remember watching it unfold in real time, as WAIS, Gopher, FTP, and the World Wide Web competed to become the standard for online delivery and sharing of documents. It was over almost before it started, as everyone was enchanted with the Web's embedded hypertext links.

The web has made tremendous progress, and the TeX world has evolved as well. However, the web still does not solve the problems solved by TeX. And TeX still does not offer what the web provides.

This creates a problem for scientists and other authors of technical material, who may be loath to give up the ease of mathematical markup, and the high quality of the result, offered by TeX (and who may be required to use it by their preferred journals), but would like to make their results available on the web. The web, after all, provides immediacy, nearly universal access, and opportunities for feedback and reuse that the sometimes glacial cycles of formal journal publication cannot match. On the bridge being built between the worlds of the web and of TeX, construction has begun from each side of the divide; whether the two efforts will meet in the middle to endow us with a new, unified infrastructure we still do not know.

Typography on the web

If you follow the web-design literature you will see plenty of articles about "typography on the web." Almost all of them are about subjects such as how to best serve font files and font-licensing issues, but rarely about typography. The ability to specify fonts for web pages, rather than relying on the user's system fonts, represents a significant advance and is crucially important for displaying mathematics on the web, as we'll discuss below. But despite some progress in the text-layout engines used in web browsers, the state of typography on the web is still not up to the (perhaps slightly obsessive) standards that habitual TeX users have learned to expect.

Here are two figures illustrating the problem. The left image shows TeX output, and the right image shows how the Chrome web browser displays the same text. I've used the same font in both examples and tried to get the line length and leading to be the same.

The most obvious deficiency in the web example is the inconsistent (and often large) gaps in the spacing of the text, caused by the lack of hyphenation. But it exhibits more subtle problems, too, such as the absence of ligatures (note "coffin" at the end of line eleven). Web results can be improved by using a JavaScript hyphenator, but until something equivalent to TeX's whole-paragraph optimization is implemented in browsers, their rendering of text will always be—at least subtly—inferior.

These examples are of simple prose. More demanding typography, such as might be encountered in a "critical edition" texts with multiple languages and scripts or footnotes within footnotes, or in poetry, demands a degree of typographic control that is nearly impossible to achieve with HTML.

One obvious solution might be to simply serve our PDF files and not worry about translation to HTML. Since it's easy, with LaTeX, to insert hypertext references in PDF output, one can seamlessly navigate between PDF documents and between PDF and HTML. However, the inability of PDF to reflow, its larger file sizes, its reuse-unfriendly binary nature, and a lingering prejudice against the format on the web, compel us to look for a more flexible solution.

Math on the web

Early attempts to render mathematical notation on the web, including translations from LaTeX, relied either on the insertion of images for equations or on "font hacks" that built up equations by wrapping characters from symbol fonts in individually positioned elements. Both approaches introduced assorted problems, but were the best we could do.

One attempt at a solution to these problems is MathML, a W3C standard for notation of mathematics. Despite existing for a decade, however, MathML still has poor browser support. There are tools to translate various input formats into MathML, but it's a verbose format unsuited to direct authoring.

As an example of MathML's verbosity, let's look at the display of a simple equation (which some mathematicians consider the most beautiful equation in mathematics):

In a (La)TeX document, the markup for this equation is:

    e^{i\pi} = -1

There are a handful of online services that translate TeX mathematical markup to MathML. The one at mathtowebonline.com translates the above into:

 
    <mtable class="m-equation-square" displaystyle="true"
      style="display: block; margin-top: 1.0em; margin-bottom: 2.0em">
      <mtr> 
        <mtd> 
          <mspace width="6.0em" /> 
        </mtd>
        <mtd columnalign="left"> 
          <msup> 
            <mi>e</mi>
            <mrow> 
              <mi>i</mi> 
              <mi>&#x003C0;</mi>
            </mrow> 
          </msup> 
          <mo>=</mo>
          <mo>-</mo> 
          <mn>1</mn> 
        </mtd>
      </mtr> 
    </mtable>

MathJax seems to be succeeding at what MathML set out to do. This project has changed the landscape for mathematics on the web dramatically. By including a single JavaScript-loading directive in the header of your web page, you can mix LaTeX math freely with your HTML. The reader's browser will download any required math fonts, and the results look almost identical to LaTeX. While this doesn't solve the other typographical issues mentioned above, MathJax is an excellent solution to the problem of math on the web, and far better than anything that has come before, at least if we insist on serving HTML. Its only major drawback is the considerable delay required to download the fonts (if the user doesn't already have them) and for JavaScript to render the equations. From the point of view of speed, there seems to be no advantage over serving the equivalent document as a PDF.

KaTeX is an alternative to MathJax that claims to be significantly faster. It processes mathematical markup on the server, rather than in the browser. Time will tell whether this kind of approach will replace MathJax, which can only become more useful as JavaScript engines continue to improve.

Write once ...

At first glance, one obvious way for users of TeX to embrace the web would be to continue to write their documents in TeX and transform them, using an automated tool, to HTML. In this way, a single source file could produce both a PDF and at least an approximate HTML rendering of the same document. Unfortunately, this is impossible. TeX is a Turing-complete programming language, as well as a system of markup. With difficulty, you can express any computation within the TeX language. This means that no program can determine whether an arbitrary TeX source document, when run through the TeX program, will even terminate and produce output. The only way to find out what the result of running TeX on a source document will be is to actually run TeX and look at the result.

A common problem is that TeX-to-HTML translators are usually unable to handle TeX containing user macros, since the results of macros are impossible to predict without running TeX on the source document. Here is a very simple example of a macro, which will work in plain TeX or LaTeX:

    \def\nchars#1#2{#2\count0=#1 \advance\count0 by -1
    \ifnum\count0>0 \nchars{\count0}{#2} \fi}

This macro accepts two parameters: a number and an arbitrary character. When it is called it prints the character the number of times specified, looping by recursion. If we say \nchars{30}{\textbullet} in our TeX source, we get:

printed in the output. It would be easy to make a mistake in the definition of a macro such as this that would lead to an infinite loop. Naturally, this would not result in a useful TeX document either, but it illustrates why the problem of automatic TeX translation to declarative markup is insoluble.

However, the fact that something is merely impossible doesn’t stop everyone. There are a handful of TeX-to-HTML translators available. Some do a pretty good job on relatively simple documents, and are worth considering if you have a TeX paper that you need to convert to HTML quickly.

One converter that does take the approach of requiring the user to run TeX first is the tex4ht system, which is actually distributed with TeX Live. To use tex4ht, you include a special style (and, optionally, embed special directives) in your LaTeX source, and process your document with an engine that creates a DVI file. Then you process the DVI file with a tool that can create not only HTML, but .odt (OpenOffice) and other formats. Tex4ht has, indeed, made realistic single-source authoring workflows [PDF] based on LaTeX markup a reasonable approach—if the author is willing to prepare source files with multiple output targets in mind. There are some limitations, though, including the inability to use OpenType fonts.

For example, here is the classic TeX representation of Stokes' Theorem (which we explored earlier in part one of this series), with tex4ht markup added:

 
    \documentclass{article} 
    \def\HCode#1{} 
    \begin{document}
    
    \HCode{<div name = 'top'>Top</div>} 

    Here is the elementary version of Stokes' Theorem:

    \[ \int_\Sigma \nabla\times \mathbf{F} \cdot 
    d\Sigma = \oint_{\partial\Sigma}\mathbf{F}\cdot d\mathbf{r} \]
    
    \HCode{<a href = '\#top'>Go to top</a>} 
    \end{document}

The \HCode commands in the example embed their arguments verbatim in the HTML output. We've used this mechanism to include an example of some navigation that might be useful on a web page. In order to prevent the HCode commands from interfering if the file is fed through pdflatex, the second line disables them.

The tex4ht installation includes a number of convenience scripts to include the required style files automatically, to generate images from equations, and to perform other tasks needed to produce the final output file. If HTML is the target, we process the above by passing its filename to the htlatex command. The result is an HTML file in the current directory along with a PNG image for the displayed equation:

Another approach is offered by the pdf2htmlEX project. This tool translates any PDF (not just one created by TeX) into an HTML page. In my tests, it did an accurate job at what is sometimes described as an impossible task, and the examples on its GitHub page are impressive.

But the HTML that this program generates is not human-readable and poses a challenge to search engine indexing: nearly every character is wrapped in an individually positioned HTML element. The result is HTML in name, but not in spirit; the pages are slow to load and they do not reflow. For example, this sentence (taken from the "scientific paper" example on the pdf2htmlEX site):

    Dynamic languages such as JavaScript are more difﬁcult to com-
    pile than statically typed ones.

Corresponds to the HTML shown in the figure below.

There seems to be little advantage in this approach over simply putting a PDF online.

Going the other way—writing in HTML and translating, as needed, to TeX—is a non-starter except for the simplest of documents. This is because HTML provides a small subset of what can be expressed using TeX (see part one of this series for examples, or browse the TeX Showcase). However, partial translations might be useful in certain circumstances. A interesting project in this vein is xml2tex, which allows the author to define arbitrary mappings from XML tags (and therefore from XHTML) to TeX markup.

Markup to markup

The alternative to writing in TeX or HTML and trying to translate between them is to author in another markup system, more general than either, that can be translated into a variety of target formats. The most prominent example of this approach is DocBook. In its XML incarnation, DocBook is a "schema", or set of tags, that describes various types of documents, including books, by defining the abstract intent of their various elements.

There are tools to transform DocBook files into various other text formats, such as HTML, and into "final output" formats, including PDF. It has a bad reputation among people who have gravitated to TeX in search of high-quality output, though, because, historically, the toolchain for producing PDF from DocBook created results with very poor typography. More recently, the maturing of the dblatex project, which converts DocBook to LaTeX, is making it possible to write in DocBook with HTML, high-quality PDF, and other targets in mind.

This is still not an acceptable solution for a great many authors, however, because of the daunting nature of DocBook's tag system. It is highly abstract, voluminous (there are over 400 elements), and requires deep nesting of XML elements to express commonplace intents. Despite its huge catalogue of elements, DocBook is draconian about enforcing its abstract conception of how a document should be described, making it impossible to specify purely visual output, such as bold text.

The need for a more pleasant and flexible markup system, one that was nevertheless more abstract than TeX or HTML and could be transformed to either, led Torsten Bronger to create the tbook project. Tbook is also an XML schema, but it is designed to incorporate an almost-LaTeX syntax for mathematics—and to be less verbose, avoiding DocBook's deep nesting. Here is Bronger's example of what you need to do to include a figure in DocBook:

    <figure label="1.1" float="1">
      <title>The galaxy.</title>
      <mediaobject>
        <imageobject>
          <imagedata fileref="galaxie.png" format="PNG"/>
        </imageobject>
      </mediaobject>
    </figure>

Obviously, this is more verbose and deeply nested than HTML, TeX, or any markup system that an author would be likely to find convenient. Tbook markup is more compact:

    <figure>
          <graphics file="galaxie" kind="vector"/>
          <caption>The galaxy.</caption>
    </figure>

It can be transformed into LaTeX, HTML, or even DocBook, using the author's set of XSLT stylesheets. It contains facilities for processing illustrations and performing a few other tasks for the convenience of its main intended audience: authors of scientific papers. Regrettably, tbook never gained traction, probably because modifying its behavior non-trivially meant working with XSLT—an exercise in masochism rivaling the creation of LaTeX style files.

The very idea of writing with XML tags at all is anathema to many people. Linus Torvalds, for example, believes that XML is "probably the worst format ever designed", "a complete disaster", and "horrible crap". Relief can be found by turning to the family of "lightweight" text-based formats. These formats all strive to make the document source easy for the human eye to parse, which they accomplish by adopting some conventions that evolved in email (and elsewhere) for indicating emphasis, headings, and the like.

One of the earliest of these is AsciiDoc, which was originally intended as an friendlier shorthand for DocBook. It has transformers to create HTML, LaTeX, and more, and it is designed so that both its syntax and transformations can be conveniently extended by the user. Direct transformation to LaTeX is still experimental, but can be accomplished by transforming to DocBook and using dblatex or other DocBook tools for the final step. Here is a somewhat recursive example that gives some of the flavor of AsciiDoc markup:

    What's New in TeX: Part 2     
    =========================

    Write once...
    -------------

    Here is the http://tbookdtd.sourceforge.net/db-diss.html[tbook
    author's] example of what you need to do to *include a figure* in
    DocBook: 

    .DocBook Example
    ---- 
    <figure label="1.1" float="1">
      <title>The galaxy.</title>      
      <mediaobject>   
        <imageobject>    
          <imagedata fileref="galaxie.png" format="PNG"/>
        </imageobject>
      </mediaobject>
    </figure>
    ----

    Obviously, this is _more verbose_ and deeply nested

This figure shows what the HTML looks like when this sample is run through the asciidoc command, using the default stylesheet.

Pandoc is a Haskell program that can transform between many pairs of file formats. It is not a solution for authoring for the web using LaTeX, despite offering a LaTeX-to-HTML translator, because its LaTeX parser is still under development. However, it can translate from a highly extended form of Markdown (originally a dumbed-down AsciiDoc) to LaTeX, HTML, and many other formats. Pandoc makes it realistic to adopt its dialect of Markdown as an authoring language and target the web, PDF, and even OpenOffice and Word. You can use LaTeX math directly, and even transform your text to DocBook, if your publisher requires it.

Sphinx began as a solution for writing Python documentation, but is being adopted by a growing number of authors of technical books and papers. Its input is reStructuredText, which is another lightweight format similar to AsciiDoc. It can convert to HTML, LaTeX, and a few other formats, such as man pages. Sphinx may be a more convenient choice than Pandoc if you intend to alter the transformation engine and are more familiar with Python than with Haskell.

Pandoc, Sphinx, and AsciiDoc are all mature, realistic solutions that should allow you to repurpose your possibly math-heavy documents for multiple targets, including the web, PDF, and more. They, as well as everything else I've discussed in this article, are free and open source. They can be used in conjunction with MathJax, and so can serve, for authors of technical material, as the bridge between the web and the TeX world.

I hope I've provided some sense, in this article and the last, of how the two great branches of technical publication have each gained greater power and added interoperability in recent years. Despite the progress in both the TeX and web worlds, however, we still face a Tower of Babel of markup languages and a paralysis of choice in technology when sitting down to write. The difficulty is not merely in finding a workflow that is congenial, which is simple enough, but in making choices that will future-proof your content and allow it to travel, with little manual intervention, from journal to web site, from book chapter to slideshow.

I regret that there is no obvious, final recommendation to give here in this final paragraph—except what is probably self-evident to readers of this publication, which is to stick with free software and text-based formats that allow the profitable use of version control and all our familiar utilities. And, finally, to relax and remember that the quality of your content is infinitely more important than the technology you use to express it.

Index entries for this article
GuestArticles	Phillips, Lee

What's new in TeX, part 2

Posted Oct 29, 2015 2:11 UTC (Thu) by roc (subscriber, #30627) [Link] (3 responses)

It's too bad you didn't try Firefox for your "typography on the Web" example. Firefox supports auto-hyphenation (if you enable it with CSS "hyphens:auto" and set lang="..." in the HTML). It also enables ligatures at all font sizes. Chrome != the Web.

What's new in TeX, part 2

Posted Oct 29, 2015 2:21 UTC (Thu) by leephillips (subscriber, #100450) [Link] (2 responses)

You are absolutely correct. Also, Firefox supports MathML, which Chrome does not. Firefox and other browsers that support CSS hyphenation still, however, do not control for such things as consecutive hyphens, widows, orphans, and other things that make paragraphs look bad and hard to read. Thanks for pointing out my omission.

What's new in TeX, part 2

Posted Oct 29, 2015 7:11 UTC (Thu) by matthias (subscriber, #94967) [Link]

I seldomly see orphans and widows in the web. The reason is that there seems to be no good solution for breaking text down into columns. Looking at standard 16:9 monitors, there is easily space for 2-3 columns which would make reading much easier than those overlong lines, one often sees. Once this support is there, of course we have to avoid orphans and widows.

However, I am not sure what would be the best solution for pages that do not fit entirely on screen, even with columns. Breaking them down into smaller pages would be one option. Horizontal scrolling (to see more columns) another. Vertical scrolling probably would not work that well, as obviously a single column should not be higher than the screen.

Interestingly, for small devises like phones, the existing one "endless" column approach works much better than for bigger devices.

Limited MathML support

Posted Oct 29, 2015 7:17 UTC (Thu) by pr1268 (subscriber, #24648) [Link]

The paucity of Web browser support for MathML is depressing. It's an ISO/IEC standard, and yet the reasons Chrome, IE, and Opera/Safari dropped support (or never supported it fully) are lame. Sigh...

Thanks for the article, Lee!

Most beautiful equation

Posted Oct 29, 2015 4:21 UTC (Thu) by brugolsky (subscriber, #28) [Link] (11 responses)

Quibble: the "most beautiful equation" is better written as e^{i\pi} + 1 = 0, as that presentation highlights how the various operations and their unit elements delicately combine and relate elementary arithmetic, algebra, geometry, & calculus.

Most beautiful equation

Posted Oct 29, 2015 7:04 UTC (Thu) by pr1268 (subscriber, #24648) [Link] (2 responses)

Should I assume you like this form better because it includes 0 (zero)?

Zero was such an abstract concept that the Romans didn't have a clue. (How do I write 0 in Roman numerals?)

Thanks to 9th Century India (by way of Persia and Arabia) did 0 find its way to Europe. (Source: Wikipedia [which also states the profound importance of 0 in mathematics].)

Most beautiful equation

Posted Oct 29, 2015 9:50 UTC (Thu) by ncm (guest, #165) [Link]

Off-topic, maybe, but it's amusing to note that Europeans were using abacuses -- which implicitly use place notation, and zero -- for centuries before they began transcribing the results that way.

Most beautiful equation

Posted Oct 29, 2015 11:02 UTC (Thu) by ballombe (subscriber, #9523) [Link]

You are confusing zero the digit with zero the number. Zero the number was know by ancient Egyptians, alongside negative numbers. I do not know why people are relating invention of positional notation with the invention of the zero, since before positional notations there were no true digits, only numbers.

Source: https://en.wikipedia.org/wiki/0_%28number%29#Egypt

Most beautiful equation

Posted Oct 29, 2015 11:08 UTC (Thu) by kingdon (guest, #4526) [Link] (6 responses)

Well, if we want to quibble about Euler's identity, I suppose I need to mention e^{i\tau} = 1, and point to http://tauday.com/tau-manifesto which has a rather lengthy discussion of this.

Tau Manifesto

Posted Oct 29, 2015 12:57 UTC (Thu) by fmyhr (subscriber, #14803) [Link]

Thanks for that link, I thoroughly enjoyed it.

Most beautiful equation

Posted Oct 29, 2015 13:11 UTC (Thu) by brugolsky (subscriber, #28) [Link]

Ha, good luck to tau advocates! I was "traumatized" decades ago, at the age of 10, when I realized that my hero Euler had stained mathematics by implicitly giving credence to the ancient but wretched concept of "diameter"! The "5 constant presentation" of Euler's equation is a consolation prize for the errors of history. ;-)

Most beautiful equation

Posted Nov 3, 2015 10:34 UTC (Tue) by ballombe (subscriber, #9523) [Link]

This manifesto omits all the instances where tau is used as a variable:
Should we rewrite the nome
q(tau) = exp(2*i*pi*tau)
as
q(pi) = exp(i*pi*tau)

Most beautiful equation

Posted Nov 3, 2015 12:28 UTC (Tue) by paulj (subscriber, #341) [Link] (2 responses)

Interesting text, but jeebus, it's not very readable. Could they not have found a way to render the TeX markup? (and I know Tex math markup!).

Most beautiful equation

Posted Nov 3, 2015 14:44 UTC (Tue) by johill (subscriber, #25196) [Link] (1 responses)

I think it's java-script enabled to render that way? At least it does for me on chromium.

Most beautiful equation

Posted Nov 3, 2015 16:34 UTC (Tue) by kingdon (guest, #4526) [Link]

Correct, it uses the (Apache-licensed) MathJax library so if javascript is blocked or malfunctions the math won't get rendered.

Most beautiful equation

Posted Oct 29, 2015 12:29 UTC (Thu) by leephillips (subscriber, #100450) [Link]

You are right, I should have written it that way. The more ordinary explanation for the formula's beauty is that it combines the five most important numbers.

What's new in TeX, part 2

Posted Oct 29, 2015 6:41 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

"What's new in TeX" sounds a just like "Exciting developments in plate tectonics!"

What's new in TeX, part 2

Posted Oct 29, 2015 6:52 UTC (Thu) by pr1268 (subscriber, #24648) [Link]

Hey, Knuth's creation was/is so profound it caused the Earth to tremble!

;-)

What's new in TeX, part 2

Posted Oct 29, 2015 9:54 UTC (Thu) by Seegras (guest, #20463) [Link] (2 responses)

> Tex4ht has, indeed, made realistic singe-source authoring workflows [PDF] based on LaTeX markup
> a reasonable approach—if the author is willing to prepare source files with multiple output
> targets in mind.

You nearly had me. But then comes this:

> There are some limitations, though, including the inability to use OpenType fonts.

I do understand that OpenType probably can't be used on the Web, but does this mean I can't use OpenType fonts in the document at all? Of course, all my documents use OpenType fonts. Ligatures, you know..

What's new in TeX, part 2

Posted Oct 29, 2015 12:27 UTC (Thu) by leephillips (subscriber, #100450) [Link]

Eitan Gurari, the creator of tex4ht, died unexpectedly in June 2009, and nobody has come forward to advance this rather amazing piece of software past the DVI age. So, no fontspec nor OpenType in the document at all.

What's new in TeX, part 2

Posted Oct 29, 2015 23:28 UTC (Thu) by michal_h21 (guest, #105104) [Link]

It is actually possible to use tex4ht with Open Type fonts, both used in the TeX document and in the generated HTML file. It isn't supported by default, due to a bug in tex4ht DVI processor, but some tips are given in [1]. In the HTML, you can either include local fonts [2], or use some service such as Google Fonts [3]

[1] http://michal-h21.github.io/samples/helpers4ht/fontspec.html
[2] http://tex.stackexchange.com/a/166061/2891
[3] http://tex.stackexchange.com/a/247479/2891

What's new in TeX, part 2

Posted Oct 29, 2015 10:55 UTC (Thu) by leipert (guest, #105098) [Link] (1 responses)

The article contains a mistake regarding the KaTeX library, the following is stated:

"It processes mathematical markup on the server, rather than in the browser."

KaTeX however supports in-browser rendering via JavaScript.
Server side rendering is possible, the JavaScript library does not need to be included.
In both cases inclusion of the CSS and fonts is necessary.

What's new in TeX, part 2

Posted Oct 29, 2015 12:30 UTC (Thu) by leephillips (subscriber, #100450) [Link]

True. I meant to say "can process...".

(La)TeX is awesome

Posted Oct 29, 2015 12:09 UTC (Thu) by dskoll (subscriber, #1630) [Link]

Thanks for this article. We use LaTeX and tex4ht to produce documentation for our commercial products. They do an amazing job. We have both beautiful PDF manuals and nicely-accessible HTML versions and because they're generated from exactly the same source, they're guaranteed to be semantically identical.

We also do a bit of pre- and post-processing so that from any page in our product's web interface, there's a help link that takes you to the exact section of the manual describing that page. And finally, the documentation process fits extremely well into our git/make workflow.

Sometimes the old ways really are the best ways.

What's new in TeX, part 2

Posted Oct 29, 2015 13:43 UTC (Thu) by jezuch (subscriber, #52988) [Link]

> Another approach is offered by the pdf2htmlEX project. This tool translates any PDF (not just one created by TeX) into an HTML page.

Recent versions of Chromium and (I think?) Firefox can open PDFs directly. Works quite well, I think, at least for the documents that I tried.

What's new in TeX, part 2

Posted Oct 29, 2015 13:47 UTC (Thu) by jnareb (subscriber, #46500) [Link] (2 responses)

One of simpler way of generating HTML out of *LaTeX* sources (no plain TeX) is LaTeX2HTML. It is configurable, and produces quite readable if uninspired results.

BTW. why AsciiDoc was hyperlinked twice, but Markdown wasn't hyperlinked at all?

What's new in TeX, part 2

Posted Oct 29, 2015 13:57 UTC (Thu) by dskoll (subscriber, #1630) [Link]

We used to use LaTeX2HTML, but I find tex4ht produces far superior output.

The HTML produced by tex4ht is not quite as readable as that produced by LaTeX2HTML, but it's still decent enough that it can be post-processed reasonably easily.

What's new in TeX, part 2

Posted Oct 29, 2015 13:57 UTC (Thu) by leephillips (subscriber, #100450) [Link]

Original Markdown is useless for the kinds of documents we're talking about here, with footnotes, citations, equations, etc. That's why I mention it in the context of Pandoc's enhanced Markdown, which is useful for these purposes.

Commenting on my experience

Posted Oct 29, 2015 17:08 UTC (Thu) by gwolf (subscriber, #14632) [Link]

Thanks for a nice read, Lee. I will surely take a look at some of the tools you mention.

I have published two books using LaTeX (both available online, although both in Spanish), and both are in some way related to this article — The first, non-technical in nature, is «Construcción Colaborativa del Conocimiento» (http://seminario.edusol.info/), an overview of the permissive-licensing creative landscape from eleven authors from different disciplines, while the second is technical (although not mathematically heavy), «Fundamentos de sistemas operativos» (http://sistop.org/).

As for the first, the authors submitted their content via a Web platform (fundamentally based on Drupal). The editting work was basically converting that to LaTeX, for which I used gnuhtml2latex (which I maintain in Debian; discontinued upstream for good reasons). The experience was... Nice, it made me learn a lot... But I would not go down that path again if possible. Fixing all the markup missed by gnuhtml2latex was painful, and once I started the editorial process, explaining to all of the authors the text was effectively frozen was a topic by itself :)

The second experience has been very positive, although there are some things I would like to polish. I am using a stack controlled by Emacs org-mode, which allows me to use a very light markup and export very good quality LaTeX code. The sources for the book are very easy to follow with no knowledge of its syntax (i.e.. https://raw.githubusercontent.com/gwolf/sistop/master/not... yields chapter 1 "in the raw", and can be easily converted into LaTeX or into HTML (as an example, https://github.com/gwolf/sistop/blob/master/notas/01_punt... shows that same chapter converted by GitHub).

To finish my work with this second book, I only need a bit of time for polishing the result. I have most of the book exportable to HTML chapter by chapter (which is enough, say, to craft an EPUB), but I need to incorporate the needed changes for it to incorporate my bibliographic references done with Biblatex; writing a post-processor with this functionality cannot be too hard, but I have to get some free time to do it :)

What's new in TeX, part 2

Posted Oct 29, 2015 17:44 UTC (Thu) by oever (subscriber, #987) [Link]

A small correction. .odt is the file extension for OpenDocument Format which is supported by more software than just OpenOffice.

What's new in TeX, part 2

Posted Oct 30, 2015 1:57 UTC (Fri) by karkhaz (subscriber, #99844) [Link] (2 responses)

A nice and fairly complete LaTeX to HTML translator is HeVeA. It's mostly used for writing documentation for software, but can be used for other tasks. The manual is here:

http://hevea.inria.fr/doc/index.html

(created using hevea).

What's new in TeX, part 2

Posted Oct 30, 2015 12:02 UTC (Fri) by leephillips (subscriber, #100450) [Link]

Thanks for mentioning this - there was no space to list all related projects. Note that HeVeA works on a subset of LaTeX, but it's very capable.

What's new in TeX, part 2

Posted Oct 30, 2015 21:13 UTC (Fri) by droundy (subscriber, #4559) [Link]

I was also disappointed by the lack of inclusion of latexml. True, it converts to mathml, which is a limitation, but it is under ongoing development, and I think it shows some promise.

Hyphenation is overrated

Posted Nov 3, 2015 19:17 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

I've been thinking about this article. And more and more it seems that trying to move TeX to web is a wrong idea.

End-of-line hyphenation is severely overrated, it actually slows readers down and looks fugly. And it's not even needed in modern browsers!

Scientific articles are most often formatted as two columns of text, as it is necessary for real dead-tree journals. You have to use small fonts to actually fit enough content into limited journal space and it's really hard to read long lines of small text.

But hey, we're now in the world where we can actually _rescale_ the text on-demand. Why should we follow the two-columns style?

Hyphenation is overrated

Posted Nov 10, 2015 15:55 UTC (Tue) by nix (subscriber, #2304) [Link]

Because e-readers, phones, etc, have small screens. You don't always want to carry a huge screen around with you!

Also, the recommended (for good reason) line lengths of single lines is not many more times than the average length of a word -- so for longer words, a way of splitting them will *always* be needed for good appearance.