Logo Computer scientist,
engineer, and educator
• Software • Utility corner

epub2txt — Extract text from EPUB documents

Version 0.1.3, March 2016

What is this?

epub2html is a simple command-line utility for extracting text from EPUB documents and, optionally, re-flowing it to fit a text display of a particular number of columns. It is written entirely in ANSI-standard C, and should run on any Unix-like system with a C compiler. It is intended for reading EPUB e-books on embedded systems that can't host a graphical EPUB viewer, or converting such e-books to read on those systems. However, it should be robust enough for other purposes, such as batch indexing of EPUB document collections.

epub2html favours speed and low memory usage over accuracy of rendering. Most of the formatting of the source document will be lost but, with a text-only display, this is likely to be of little consequence.

This utility is specifically written to have no dependencies on external libraries, except the standard C library, and even on this is makes few demands. It does expect to be able to run an "unzip" command, however. The purpose of minimizing dependencies is to allow the utility to build on embedded systems without needing to build a bunch of dependencies.

epub2txt will output UTF8-encoded text by default, but can be told to output ASCII, in which case it will try to convert non-ASCII characters into something displayable if possible.

Prerequisites

epub2html is intended to run on Linux and other Unix-like systems. It makes use of the common Unix unzip utility but has no other dependencies. It builds and runs on Windows under Cygwin, but not as a native Windows console application. The system must be set up such that there is a temporary directory at /tmp that users can write to, unless the environment variable TMP is set, in which case the utility will use that instead.

Building and installing

epub2txt builds and installs from a simple Makefile. On most systems, all you should need to do is
$ make
# make install

Bugs and limitations

There is no support for any form of DRM or encryption, and such support is unlikely to be added in the future.

epub2txt only handles documents that use UTF8 (or ASCII) encoding (but I believe that UTF8 is more-or-less universal in EPUB), and writes output only in UTF8 encoding, regardless of the platform's locale. This limitation exists because epub2txt does all its own multibyte to fixed-size character encoding conversions to avoid creating a dependency on an external library. Doing this for UTF8 is enough work on its own; doing it for arbitrary encodings would be overwhelming.

The program can't correct errors in encoding, and there are a large number of EPUB documents in public repositories that contain encoding errors. A common problem is spurious use of non-UTF8 8-bit characters, often in documents that have been converted from Microsoft Office applications.

epub2txt does not right-justify text, as there are already many good utilities to do this. A simple approach is to pipe the output into nroff, without specifying a width (-w). Not specifying a width turns off line-breaking in epub2txt, allowing nroff to justify the paragraphs. It will probably also be necessary to use the --ascii option, as nroff does not handle UTF8 text very well. For example:

epub2txt -a mydoc.epub | nroff

epub2txt extracts text aggressively, and will include things that cannot possibly be rendered properly in plain text. This includes constructs like indices and tables of contents, which will be of little use. The captions of pictures will also likely be included, even though the pictures themselves can not. It seemed better to err on the side of extracting too much text than too little; unfortunately there is little in the EPUB format to distinguish content that is meaningful in a text-only representation from that which is not.

It is unlikely that any kind of fixed-layout structure of the source document will be rendered accurately in plain text, so epub2txt does not try. Tabs and other layout elements are collapsed into spaces, and text re-flowed according to the set line length, if any.

Conversion of Unicode to ASCII is, in the general case, impossible. The --ascii switch tells epub2txt to perform some common conversions, such as straight quotes for angled quotes. It will also attempt to replace accented latin characters with non-accented equivalents, at least for commonly-used characters. However, there are a huge number of characters in the Unicode set that cannot be rendered, even approximately, in ASCII.

Revision history

0.1.3, March 2016 Fixed a bug that caused epub2txt to fail when XML files contained a UTF-8 BOM
0.1.2, September 2015 Fixed a bug that caused strings like "%222022020," which might legitimately appear in URLs, to be treated as text length specifiers.
0.1.1, April 2015 Fixed some bugs with integer sizes that caused problems on 64-bit systems
0.0.1 First functional release

Downloads

Please read the installation instructions before downloading. Note also that only the source bundle is sure to be up-to-date; the binaries depend on the availability of specific build platforms, and always lag the source by a minor version or two.

Source code bundle for all platforms
Compiled binary for Cygwin on Windows
The latest source can also be checked out from github.

Further information

epub2txt man page

Author and legal

epub2txt is maintained by Kevin Boone, and distributed under the terms of the GNU Public Licence, v2.0. Essentially, this means that you may use this software as you wish, at your own risk, provided that the original author continues to be acknowledged.

Please report bugs, etc., using the details on the contact page.

Copyright © 1994-2015 Kevin Boone. Updated May 18 2017