About

häu·fig, 1. frequent, common

Quick-and-dirty tool to generate a lemmatized vocabulary frequency list from an EPUB ebook.

Setup

Docker - Recommended

Using the Docker development container is the most straightforward way to run Haufig until there is a turn-key solution.

Clone this repository
Install Docker
Install Visual Studio Code
Follow Microsoft's installation instructions to configure VS Code for development within a Docker container.
Start VS Code, then run the Remote-Containers: Open Folder in Container command and open the project directory.
- For more detailed instructions see the section titled Quick start: Open an existing folder in a container in the Microsoft documentation.
In .devcontainer/devcontainer.json change the value of "SPACY_MODEL": "de_core_news_sm" to whatever spaCy model you'd like to use for your language.

Standalone

You will need:

A clone of this repository
A working Python 3.8 install
A spaCy install with the language model(s) you want to use
Ability to build and run .NET 5.0 projects

Usage

From within the project directory:

dotnet run --project ./src/Haufig.Cli/Haufig.Cli.fsproj --books [<book>...] [--model <model>] [--output-dir <path>] [--book-csvs]

OPTIONS:

    --books [<book>...]   Space-separated list of .epub files and/or directories to search for .epub files
    --model <model>       Name of the spaCy language model to use
    --output-dir <path>   Directory where the output CSV(s) will be written
    --book-csvs           Output individual CSVs for each book in addition to the merged output CSV
    --help                display this list of options.

Examples:

$> dotnet run --project ./src/Haufig.Cli/Haufig.Cli.fsproj --books "ebooks/Der Tor und der Tod by Hugo von Hofmannsthal.epub" --output-dir "outputs/gutenberg/de" --model de_core_news_sm

$> dotnet run --project ./src/Haufig.Cli/Haufig.Cli.fsproj --books "ebooks/Der Tor und der Tod.epub" "ebooks/Sidsel Langröckchen.epub" "ebooks/de" --output-dir "outputs/gutenberg/de" --model de_core_news_sm

$> cat "output/de/gutenberg/results.csv" | more
count,lemma,part of speech
2270,ich,PRON
2146,der,DET
1446,und,CONJ
775,sein,AUX
553,sich,PRON
451,der,PRON
441,in,ADP
393,haben,AUX
-- More  --

Caveats

This has barely been tested.
All sections of the ebook are processed including title page, copyright information, dedication, etc.
NLP is an imperfect science and as such some words may not be lemmatized properly

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.devcontainer		.devcontainer
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Haufig.sln		Haufig.sln
README.md		README.md
global.json		global.json
haufig.code-workspace		haufig.code-workspace

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

häu·fig, 1. frequent, common

Setup

Docker - Recommended

Standalone

Usage

From within the project directory:

Examples:

Caveats

About

Uh oh!

Releases

Packages

Uh oh!

Languages

atom-b/haufig

Folders and files

Latest commit

History

Repository files navigation

About

häu·fig, 1. frequent, common

Setup

Docker - Recommended

Standalone

Usage

From within the project directory:

Examples:

Caveats

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages