JesterJ

A highly flexible, scalable, fault-tolerant document ingestion system designed for search.

Builds are run on infrastructure kindly donated by

The problem

Frequently, search projects start by feeding a few documents manually to a search engine, often via the "just for testing" built-in processing features of Solr such as SolrCell or post.jar. These features are documented and included to help users get a feel for what they can do with Solr with minimal painful setup.

This is good, and that's how it should be for first explorations. Unfortunately, it's also a potential trap. Large-scale ingestion of documents for search is non-trivial. Many projects outgrow these simple tools and have to throw away their early exploratory work. Nobody likes setting aside valuable work, and it's natural to resist, but the longer one clings to an insufficient tool, the bigger, more difficult, and more expensive the migration is.

Common problems are:

It works "ok" for a small test corpus and then becomes unstable on a larger production corpus.
The code written to feed into such interfaces (hopefully) reproduces standard solutions to problems that have been solved many times by other search engineers over the last 20 years.
No way to recover if indexing errors or is disrupted partway through. One is forced to start again from the beginning.
If failure is related to the size of a growing corpus, failures become increasingly common, and eventually, the search index cannot be reindexed or upgraded at all.
Leveraging the power of modern multicore machines requires developers skilled at threading and concurrency, the resulting bugs can be very expensive to troubleshoot, fix and test.
Reliance on outdated, unmaintained or poorly maintained features such as the Data Import Handler (which has been removed from solr 9.0+). Such features are not used by any major companies (where committers often work), and consequently receive less attention and support.

JesterJ's solution

JesterJ makes it super easy to start with a robust, full-featured indexing infrastructure, so that you don't have to re-invent the wheel, and you don't have to throw away your early work.

The key aspects for achieving this are simplicity, robustness, flexibility, and scalability:

A variety of re-usable processing components are provided (flexibility)
Scanners (active connectors) for database and filesystem data sources (simplicity)
Custom processors only require a 4-method interface (simplicity)
Specialized classloading allows any version of a library in your custom code (flexibility, simplicity)
Simplified startup: java -jar jesterj.jar <id> <secret> (simplicity)
Built in embedded Cassandra for performant persistent storage (simplicity)
Optional auto-detection of changes to documents (flexibility, simplicity)
Automatic fault-tolerant restart skipping previously seen documents (robustness, scalability)
Multi-threaded processing to leverage modern machines with large numbers of cores. (scalability)
Explicit and direct control of threading. Easy to ensure more threads working on heavy steps (scalability)
Single system handling multiple data sources (flexability, scalability, simplicity)
Pre-baked batching of documents for efficient transmission to the search engine (scalability, simplicity)
Directed acyclic graph (DAG) capable processing model, and graphical visualization (flexibility, simplicity)

DAG-structured processing is a key feature that is not provided by other tools. Most other tools require a linear pipeline structure, which can become limiting. As time passes, features and enhancements often add complexity. Multiple data sources are also a common dimension for growth. With other systems, you wind up deploying a system per data source. JesterJ is designed to handle complex indexing scenarios.

Consider the following hypothetical indexing workflow, where the system has evolved from a simple linear ingestion into a single index:

The source data format changed from, effectively creating a new data source (old data may need reindexing)
An external system needed to know that the document was received
Product features required a faster, optimized line-item-only search index
New features were added to the product that required block-join indexing, but old features couldn't be migrated, so a new index was required.
Two new systems also wanted to be notified

In other tools, this will mean six indexing processes (two sources times three indexes), all of which need to send messages, none of which are coordinated if one fails. In JesterJ, it is all one coherent system:

JesterJ handles such scenarios with a single centralized processing plan, and there is no need to deploy new indexing infrastructure. Furthermore, JesterJ will ensure that if the system is unplugged partway through indexing, you won't get a second message about an order received for everything it processed previously (fault tolerance). The default mode for JesterJ is to ensure at-most-once delivery for steps that are not marked safe or idempotent. Safe steps do not have external effects, and idempotent steps may be repeated en route to the final processing end point.

Getting Started

The best place to start learning more is the documentation in the wiki

Project Status

Current release: 1.0. This is the best version to use, and should be fully functional.

NOTE: The current release targets any design and load a single machine can service. JesterJ is explicitly designed to take advantage of machines with many processors in future releases. You can design your plan with duplicates of your slowest step to alleviate bottlenecks. In future versions, the load will be spread across machines. Currently, each duplicate implies an additional thread working on that step.

JDK versions

Presently, only JDK 11 has been tested regularly. Unit tests have passed on JDK 17, but the initial system startup and custom class loading are the most JDK-sensitive parts, so we welcome feedback on experiences with more recent JDK versions. Any Distribution of JDK 11 should work. Support for Java 17 and future LTS versions is among our highest priorities for future releases. Building with the latest uno-jar version may be sufficient, but this is not yet certified. nsoft/uno-jar#37

Discord Server

Discuss features, ask questions, etc, on Discord. https://discord.gg/RmdTYvpXr9

Features:

In this release, we have the following features.

Ability to visualize the structure of your plan (.dot or .png format: example from unit tests here )
Simple filesystem scanner for locally mounted drives (replacement for post.jar)
JDBC scanner (replacement for Data Import Handler!)
Scanners can remember what documents they've seen (or not, boolean flag)
Scanners can recognize updated content (or not, boolean flag)
Send to Solr processor with tunable batch sizes
Tika processor to extract content from Word/PDF/XML/HTML, etc (Replacement for SolrCell!)
Stax extract processor for dissecting XML documents directly.
Copy field processor to rename source fields to the desired index field
Regexp replace processor to edit field content, or drop fields that don't match
Split field processor to split delimited values for multi-value fields
Drop field processor to get rid of annoying excess fields.
Field template processor for composing field content using a Velocity template
URL encode processor to encode the value of a field and make it safe for use in URLs
Fetch URL processor for acquiring or enhancing content by contacting other systems
Log and drop processor for when you identify an invalid document* Date Reformat processor, because dates, formatting... always. (sigh)* Human Readable File Size processor
Solr sender to send documents to Solr in batches.
Pre-Analyze processor to move Solr analysis workload out of Solr (just give it your schema.xml!)
Embedded Cassandra server (no need to install Cassandra yourself!)
Cassandra config and data location configurable, defaults to ~/.jj/cassandra
Support for fault tolerance, writing status change events to the embedded Cassandra server
Support for user-written document processors. (see documentation)
70% test coverage (jacoco)
Simple, single Java file to configure everything, non-Java programmers need only follow a simple example (for use cases not requiring custom code)
If you DO need custom code, that code can be packaged as an uno-jar to provide all required dependencies and escape from any library versions that JesterJ uses! You only have to deal with your OWN jar hell, not ours! Of course, you can also rely on whatever we already provide. The classloaders for custom code prefer your uno-jar and then default to whatever JesterJ has available on its classpath.
Runnable example to execute a plan that scans a filesystem, and indexes the documents in Solr.

Release 1.0 is intended to be usable for small to medium-sized projects (tens of millions of documents or low hundreds of millions of documents with some patience).

Road Map

The best guess at any time of what will be in future releases is given by the milestones filters on our issues page

Name		Name	Last commit message	Last commit date
Latest commit History 645 Commits
.github		.github
code		code
.editorconfig		.editorconfig
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

JesterJ

The problem

JesterJ's solution

Getting Started

Project Status

JDK versions

Discord Server

Features:

Road Map

About

Uh oh!

Releases 5

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

Uh oh!

License

nsoft/jesterj

Folders and files

Latest commit

History

Repository files navigation

JesterJ

The problem

JesterJ's solution

Getting Started

Project Status

JDK versions

Discord Server

Features:

Road Map

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 5

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages