epub2txt reference

epub2txt - Extract text from EPUB documents


epub2txt [options] {files...}



epub2txt is simple utility for extracting text from EPUB documents. It is mainly intended for reading EPUB e-books on systems that cannot run a graphical EPUB viewer, and favours speed over optimal rendering of complex layout.

The output is to stdout; if multiple files are specified, they are simply processed sequentially. Unless otherwise specified, the character encoding of the output is the same as for the EPUB source, which is invariably UTF-8. However, epub2txt can attempt to output plain ASCII if so instructed.



Converts Unicode characters in the EPUB document that have close ASCII equivalents to ASCII, and replaces any others with '?'. For example, the Unicode left single quote is similar to an ASCII straight quote. This option is intended for use when feeding the output of epub2txt into another utility that cannot deal with UTF8 encoding.

Set the level of debugging information, from 0 (none) to 4 (extremely detailed tracing).

If no output width is specified, then this option bypasses epub2txt's processing of whitespace. Normally whitespace is trimmed from the beginnings of lines, and multiple whitespace in the middle of a line is condensed to a single space. Bypassing whitespace trimming makes operation considerably faster on long documents, but is really only useful if the output of epub2txt is being further processed by something else that formats the text. This option is incompatible with -w.

Write out the paragraph number every {count} paragraphs. The paragraph number is written in the form *** PARA NNN, to make it noticeable. --paras and --start provide a simple way to resume reading a document from a specific point.

Start output from paragraph {para} in the source document. The --paras option tells epub2txt to print the paragraph number at particular intervals; this paragraph number can then be used as the argument to --start to resume reading from a particular point. The value of --paras depends on the amount of text that will fit on the screen and, to some extent, the typical length of a paragraph in the source document. Too small a value makes the paragraph marks intrusive when reading; too large and it can be quite difficult to find the current paragraph number. Note that epub2txt specifies document position in paragraphs rather than lines or pages, because paragraphs are a feature of the source document, whilst lines and pages will vary according to the amount of text that fits on the screen.

Format the output to fit into a specified width. If this option is omitted, or is set to zero, then the output is assumed to be of unlimited width. Otherwise, line breaks are inserted to keep the output within the specified width. For direct reading, it is usually helpful to set a width. For situations where the output is being processed by another application, it is nearly always better to omit this option, and let target application handle line breaking or justification itself. This is because in a plain text document, there is no way for the target application to distinguish line breaks inserted by epub2txt for formatting, from line breaks that were part of the document structure (e.g., between paragraphs).

Displays the version and copyright infomation.



epub2txt is maintained by Kevin Boone, and is open source under the terms of the GNU Public Licence, version 2.0. There is no warranty of any kind.