• Utility corner
epub2txt — Extract text from EPUB documents
Version 0.1.3, March 2016
What is this?
epub2html is a simple command-line utility for
extracting text from
EPUB documents and, optionally, re-flowing it to fit a text display
of a particular number of columns. It is written entirely in ANSI-standard
C, and should run on any Unix-like system with a C compiler. It is
intended for reading EPUB e-books on embedded systems that can't host a
graphical EPUB viewer, or converting such e-books to read on those systems.
However, it should be robust enough for other purposes, such as batch
indexing of EPUB document collections.
epub2html favours speed and low memory usage over
accuracy of rendering. Most of the formatting of the source document
will be lost but, with a text-only display, this is likely to be of
This utility is specifically written to have no dependencies on external
libraries, except the standard C library, and even on this is makes
few demands. It does expect to be able to run an "unzip" command,
however. The purpose of minimizing dependencies is to allow the
utility to build on embedded systems without needing to build a bunch
epub2txt will output UTF8-encoded text by default, but can
be told to output ASCII, in which case it will try to convert non-ASCII
characters into something displayable if possible.
epub2html is intended to run on Linux and other Unix-like
systems. It makes use of the common Unix
but has no other dependencies.
It builds and runs on Windows under Cygwin,
but not as a native Windows console application.
The system must be set up such that there is a temporary
/tmp that users can write to, unless the
TMP is set, in which case the utility
will use that instead.
Building and installing
epub2txt builds and installs from a simple Makefile.
On most systems, all you should need to do is
# make install
Bugs and limitations
There is no support for any form of DRM or encryption, and such support
is unlikely to be added in the future.
epub2txt only handles documents that use
UTF8 (or ASCII) encoding (but I believe that UTF8 is more-or-less
universal in EPUB),
and writes output only in UTF8 encoding,
regardless of the platform's locale. This limitation exists because
epub2txt does all its own multibyte to fixed-size
character encoding conversions
to avoid creating a dependency on an external library. Doing this for UTF8
is enough work on its own; doing it for arbitrary encodings would be
The program can't correct errors in encoding, and there are a large number
of EPUB documents in public repositories that contain encoding errors.
A common problem is spurious use of non-UTF8 8-bit characters, often
in documents that have been converted from Microsoft Office applications.
epub2txt does not right-justify text, as there are already many
good utilities to do this. A simple approach is to pipe the output
nroff, without specifying a width (
Not specifying a width turns off line-breaking in
nroff to justify the paragraphs.
It will probably
also be necessary to use the
nroff does not
handle UTF8 text very well. For example:
epub2txt -a mydoc.epub | nroff
epub2txt extracts text aggressively, and will include things that
cannot possibly be rendered properly in plain text. This includes constructs
like indices and tables of contents, which will be of little use. The captions
of pictures will also likely be included, even though the pictures themselves
can not. It seemed
better to err on the side of extracting too much text than too little;
unfortunately there is little in the EPUB format to distinguish content that
is meaningful in a text-only representation from that which is not.
It is unlikely that any kind of fixed-layout structure of the
source document will be rendered accurately in plain text, so
epub2txt does not try. Tabs and other layout elements are
into spaces, and text re-flowed according to the set line length, if any.
Conversion of Unicode to ASCII is, in the general case, impossible. The
--ascii switch tells
epub2txt to perform some
common conversions, such as straight quotes for angled quotes.
It will also attempt to replace accented latin characters with non-accented
equivalents, at least for commonly-used characters. However, there are
a huge number of characters in the Unicode set that cannot be rendered,
even approximately, in ASCII.
0.1.3, March 2016
Fixed a bug that caused epub2txt to fail when XML files contained a
0.1.2, September 2015
Fixed a bug that caused strings like
"%222022020," which might legitimately appear in URLs, to be treated as
text length specifiers.
0.1.1, April 2015
Fixed some bugs with integer sizes that caused problems on 64-bit systems
First functional release
Please read the installation instructions before downloading. Note also
that only the source bundle is sure to be up-to-date; the binaries depend
on the availability of specific build platforms, and always lag the
source by a minor version or two.
Source code bundle for all platforms
Compiled binary for Cygwin on Windows
The latest source can also be checked out from github.
epub2txt man page
Author and legal
epub2txt is maintained by Kevin Boone, and distributed under the terms
of the GNU Public Licence, v2.0. Essentially, this means that you may
use this software as you wish, at your own risk, provided that the
original author continues to be acknowledged.
Please report bugs, etc., using the details on the