htmls-to-datasette

Htmls-to-datasette is a tool to index HTML files into a Sqlite database so they can be searched and
visualized at a later time. This can be useful for web archival/web clipping purposes.
The database created is designed to be served on Datasette and to allow to read the indexed
files through it.
This tool was created to serve my own work flow that is:

Have a browser with SingleFile extension installed.
When there is an interesting blog post or article save a full web page into one HTML using SingleFile.
The created .html file on the downloads folder is moved to a common repository (via cron job).
This common repository is synched to my main server (I use Syncthing for this).
On my personal server all the new HTML files are moved to the serving folder and this indexer is called to populate
the search database.
Datasette with an specific configuration will allow searching on these files and reading them online.

The indexing tool can insert the HTML contents on the database itself, to be served from there, or not. In this second
case the files will be served from the location they were indexed from.

Setup

Standard install

pip install htmls-to-datasette

And you can start running the command, use --help to see specific commands help.

htmls-to-datasette --help
htmls-to-datasette index --help

Development install

This project uses Poetry to make it easier to setup the appropriate dependencies to run.
Installation steps for Poetry can be checked on their website but for
most of the cases this command line would work:

curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -

Note that you should exercise caution when running something directly from Internet.

Install dependencies:

poetry install

Run

You can use poetry run in front of htmls-to-datassette so it is using the virtual environment that you just created
before.

poetry run htmls-to-datassette [options]

Build an installable package

poetry build # The resoult will be in dist directory

I use pipx for installing packages on isolated environments. You can install this package
from the dist/ directory in whichever way you prefer or you can
install pipx.
The installation with pipx would be similar to:

pipx install dist/htmls-to-datasette-0.1.2.tar.gz

Usage

htmls-to-datasette index [OPTIONS] [INPUT_DIRS]... will create a database named `htmlstore.db’ (by default).

Example

Get into the server directory:

cd server

Because this example requires Datasette to run you would have to get them using poetry:

poetry init

Now index the example file using htmls-to-datasette:

htmls-to-datasette index input

All files contained in input (.html and .htm) will be indexed and a full text search index created. Whenever
there are new files to be indexed this command can be run in the same way.
And now run the Datasette server:

poetry run datasette serve htmlstore.db -m metadata-files.json --plugins-dir=plugins

You’ll see the address to send your browser to on the screen. There is also a shortcut to make it easier to perform a
full text search. Should be reachable at http://127.0.0.1:8001/htmlstore/search just fill the query on the ‘q’ parameter
and you will search over the indexed HTMLs. Click on the HTML file name will load its contents.
For this to work the server will require the files to be on their location (relative in this case). So if the input
folder is moved away or not accesible the files would be searchable but the contents will not be available.
There is an additional example that stores these files onto the Sqlite database itself. This has its advantages as
everything needed for serving and searching the content will be contained in one file.

# You should be on the server directory
rm htmlstore.db   # Remove the previous example's database
htmls-to-datasette index input --store-binary  # Index files and store its contents

# Now run Datasette, note that now we need to use a different metadata as the contents needed to be served
# in a different way (from the DB itself). 
poetry run datasette serve htmlstore.db -m metadata-binary.json --plugins-dir=plugins

TODO

Clear content when extracting files.
Better documentation.