ebook-tools
Tools
Shell
eBook Tools
This is a collection of bash shell scripts for automated and semi-automated organization and management of large ebook collections. It contains the following tools:
-
organize-ebooks.sh
is used to automatically organize folders with potentially huge amounts of unorganized ebooks. This is done by renaming the files with proper names and moving them to other folders:- By default it searches the supplied ebook files for ISBNs, downloads the book metadata (author, title, series, publication date, etc.) from online sources like Goodreads, Amazon and Google Books and renames the files according to a specified template.
- If no ISBN is found, the script can optionally search for the ebooks online by their title and author, which are extracted from the filename or file metadata.
- Optionally an additional file that contains all the gathered ebook metadata can be saved together with the renamed book so it can later be used for additional verification, indexing or processing.
-
Most ebook types are supported:
.epub
,.mobi
,.azw
,.pdf
,.djvu
,.chm
,.cbr
,.cbz
,.txt
,.lit
,.rtf
,.doc
,.docx
,.pdb
,.html
,.fb2
,.lrf
,.odt
,.prc
and potentially others. Even compressed ebooks in arbitrary archive files are supported. For example a.zip
,.rar
or other archive file that contains the.pdf
or.html
chapters of an ebook can be organized without a problem. -
Optical character recognition (OCR) can be automatically used for
.pdf
,.djvu
and image files when no ISBNs were found in them by the fast and straightforward conversion to.txt
. This is very useful for scanned ebooks that only contain images or were badly OCR-ed in the first place. - Files are checked for corruption (zero-filled files, broken pdfs, corrupt archive, etc.) and corrupt files can optionally be moved to another folder.
- Non-ebook documents, pamphlets and pamphlet-like documents like saved webpages, short pdfs, etc. can also be detected and optionally moved to another folder.
<p> <a rel="nofollow noopener" target="_blank" href="https://asciinema.org/a/147116"></a> </li> <li> <code>interactive-organizer.sh</code> can be used to interactively and manually organize ebook files quickly. A good use case is the organization of the files that could not be automatically organized by the <code>organize-ebooks.sh</code> script. It can also be used to semi-automatically verify the organized files by the above script and potentially reorganize some of them:</p> <ul dir="auto"> <li> If <code>organize-ebooks.sh</code> was called with <code>--keep-metadata</code>, the interactive organizer compares the old filename with the new one and shows suspicious differences between the two. Wrongly renamed files can be interactively renamed with this script. </li> <li> There is a quick mode that skips files with names that contain the all of the original filename’s tokens. Differences due to <a rel="nofollow noopener" target="_blank" href="https://en.wikipedia.org/wiki/Diacritic">diacritical marks</a> and truncated words are handled intelligently. A list of allowed differences can be configured and interactively updated while organizing the books. </li> <li> The script can restore files back to their original location or move them to one of many different pre-configurable output folders. </li> <li> Ebooks can be converted to <code>.txt</code> and shown with <code>less</code> directly in the current terminal or they can be opened with an external viewer without exiting from the interactive organization. </li> <li> Books can be semi-automatically renamed by looking up their metadata (by ISBN or title) online. </li> </ul> </li> <li> <code>find-isbns.sh</code> tries to find <a rel="nofollow noopener" target="_blank" href="https://en.wikipedia.org/wiki/International_Standard_Book_Number#Check_digits">valid ISBNs</a> inside a file or in <code>stdin</code> if no file was specified. Searching for ISBNs in files uses progressively more resource-intensive methods until some ISBNs are found, see the documentation <a rel="nofollow noopener" target="_blank" href="#searching-for-isbns-in-files">below</a> for more details. </li> <li> <code>convert-to-txt.sh</code> converts the supplied file to a text file. It can optionally also use OCR for <code>.pdf</code>, <code>.djvu</code> and image files. </li> <li> <code>rename-calibre-library.sh</code> traverses a calibre library folder and renames all the book files in it by reading their metadata from calibre’s <code>metadata.opf</code> files. </li> <li> <code>split-into-folders.sh</code> splits the supplied ebook files (and the accompanying metadata files if present) into folders with consecutive names that each contain the specified number of files. </li></ul> <p> All of the tools use a library file <code>lib.sh</code> that has useful functions for building other ebook management scripts. More details for the different script options and parameters can be found in the <a rel="nofollow noopener" target="_blank" href="#usage-options-and-configuration">Usage, options and configuration</a> section. </p> <h1 dir="auto"> <a rel="nofollow noopener" target="_blank" id="user-content-installation-and-dependencies" class="anchor" aria-hidden="true" href="#installation-and-dependencies"></a>Installation and dependencies </h1> <p> There are two ways you can install and use the tools in this repository – <a rel="nofollow noopener" target="_blank" href="#shell-scripts">directly</a> or via <a rel="nofollow noopener" target="_blank" href="#docker">docker images</a>.<br /> Since all of the tools are shell scripts, you should be able to use them directly from source in most up-to-date GNU/Linux distributions, as long as you have the needed dependencies installed. They should also be usable on other *nix systems like OS X and *BSD if you have the <strong>GNU</strong> versions of the dependencies installed or in the <a rel="nofollow noopener" target="_blank" href="https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux">Windows Subsystem for Linux</a>.<br /> However, since non-linux systems are officially unsupported and may have unexpected issues, <a rel="nofollow noopener" target="_blank" href="https://en.wikipedia.org/wiki/Docker_%28software%29">Docker</a> containers are the preferred way to use the scripts in those systems. The docker images may also be easier to use than the bare scripts on non-GNU linux distributions or on older linux distributions like some LTS releases. </p> <h2 dir="auto"> <a rel="nofollow noopener" target="_blank" id="user-content-shell-scripts" class="anchor" aria-hidden="true" href="#shell-scripts"></a>Shell scripts </h2> <p> To install and use the bare shell scripts, follow these steps: </p> <ol dir="auto"> <li> Install the dependencies below. </li> <li> Make sure that your system has a <a rel="nofollow noopener" target="_blank" href="https://www.shellhacks.com/linux-define-locale-language-settings/">UTF-8 locale</a>. </li> <li> Clone the repository or download a release archive and extract it. </li> <li> For convenience, you may want to add the scripts folder to the <code>PATH</code> environment variable. </li> </ol> <p> You need recent versions of: </p> <ul dir="auto"> <li> <code>file</code>, <code>less</code>, <code>bash</code> 4.3+ and <strong>GNU</strong> <code>coreutils</code>, <code>awk</code>, <code>sed</code> and <code>grep</code>. </li> <li> <a rel="nofollow noopener" target="_blank" href="https://calibre-ebook.com/">calibre</a> for fetching metadata from online sources, conversion to txt (for ISBN searching) and ebook metadata extraction. Versions <strong>2.84</strong> and above are preferred because of their ability to manually specify from which specific online source we want to fetch metadata. For earlier versions you have to set <code>ISBN_METADATA_FETCH_ORDER</code> and <code>ORGANIZE_WITHOUT_ISBN_SOURCES</code> to empty strings. </li> <li> <a rel="nofollow noopener" target="_blank" href="https://sourceforge.net/projects/p7zip/">p7zip</a> for ISBN searching in ebooks that are in archives. </li> <li> Tesseract for running OCR on books – version 4 gives better results even though it’s still in alpha. OCR is disabled by default and another engine can be configured if preferred. </li> <li> Optionally <a rel="nofollow noopener" target="_blank" href="https://poppler.freedesktop.org">poppler</a>, <a rel="nofollow noopener" target="_blank" href="http://www.wagner.pp.ru/~vitus/software/catdoc/">catdoc</a> and <a rel="nofollow noopener" target="_blank" href="http://djvu.sourceforge.net/">DjVuLibre</a> can be installed for faster than calibre’s conversion of <code>.pdf</code>, <code>.doc</code> and <code>.djvu</code> files respectively to <code>.txt</code>. </li> <li> Optionally the <a rel="nofollow noopener" target="_blank" href="https://www.mobileread.com/forums/showthread.php?t=130638">Goodreads</a> and WorldCat xISBN calibre plugins can be installed for better metadata fetching. </li> </ul> <p> The scripts are only tested on linux, though they should work on any *nix system that has the needed dependencies. You can install everything needed with this command in Arch Linux: </p> <pre>pacman -S file less bash coreutils gawk sed grep calibre p7zip tesseract tesseract-data-eng python2-lxml poppler catdoc djvulibre</pre> <p> Note: you can probably get much better OCR results by using the unstable 4.0 version of Tesseract. It is present in the <a rel="nofollow noopener" target="_blank" href="https://aur.archlinux.org/packages/tesseract-git/">AUR</a> or you can easily make a package like this yourself.<br /> Here is how to install the packages on Debian (and Debian-based distributions like Ubuntu): </p> <pre>apt-get install file less bash coreutils gawk sed grep calibre p7zip-full tesseract-ocr tesseract-ocr-osd tesseract-ocr-eng python-lxml poppler-utils catdoc djvulibre-bin</pre> <p> Keep in mind that a lot of debian-based distributions do not have up-to-date packages and the scripts work best when calibre’s version is at least 2.84. For earlier versions you have to set <code>ISBN_METADATA_FETCH_ORDER</code> and <code>ORGANIZE_WITHOUT_ISBN_SOURCES</code> to empty strings. </p> <h2 dir="auto"> <a rel="nofollow noopener" target="_blank" id="user-content-docker" class="anchor" aria-hidden="true" href="#docker"></a>Docker </h2> <p> The docker image includes all of the needed dependencies, even the extra calibre plugins. There is an automatically built <a rel="nofollow noopener" target="_blank" href="https://hub.docker.com/r/ebooktools/scripts/">docker image</a> in the Docker Hub. You can pull it locally with <code>docker pull ebooktools/scripts</code>. You can also easily build the docker image yourself: simply clone this repository (or download the latest release archive and extract it) and then run <code>docker build -t ebooktools/scripts:latest .</code> in the folder.<br /> Here are some Docker-specific usage details: </p> <ul dir="auto"> <li> You can start a docker container with all the ebook tools by running <code>docker run -it -v /some/host/folder:/unorganized-books ebooktools/scripts:latest</code>. This will run a bash… </li> </ul>