This repository provides generalized library to train, test and use machine learning models. Specifically it:
- Wraps Weka 3.8.
- Automation of any combination of classifiers and features.
- Sort and prints results in many formats and levels of detail.
- Generate Excel spreadsheet files of multiple run results.
- Two pass cross validation.
- Integrates with the dataset library.
In your project.clj
file, add:
API documentation.
See the example repo that illustrates how to use this library and contains the code from where these examples originate. It's highly recommended to clone it and follow along as you peruse this README.
To create, validate, test and utilize a model you must do the following:
- Create the corpus
- Create features
- Create the model configuration
- Create the model
- Evaluating the model
- Using the model
- Testing the model
- Automating testing and overfitting
Note that this example (like clj-ml-dataset
) uses natural language processing
but the library was written to be general purpose and other non-NLP projects
can use it.
Before we can do anything, we need a annotated corpus since we'll be using
supervised learning methods. To do that, use the
machine learning dataset library
to pre-parse all utterances in the annotated corpus (follow the readme and
create zensols.example.anon-db
namesapce). You'll also need to start a
Docker instance for the Elasticsearch server as detailed in the docs.
First we have to generate the features that will be used in our model and train
our classifier. We'll generate our features from details that are parsed from
English utterances for our example so we'll use the
NLP library to parse and generate
those features from the pre-parsed utterances stored in Elasticsearch using the
clj-ml-dataset
library:
(ns zensols.example.sa-feature
(:require [zensols.nlparse.parse :as p]
[zensols.nlparse.feature :as fe]
[zensols.example.anon-db :as adb]
[zensols.model.execute-classifier :refer (with-model-conf)]))
(defn create-features
([panon]
(create-features panon nil))
([panon context]
(let [tokens (p/tokens panon)]
(merge (fe/verb-features (->> panon :sents first))
(fe/token-features panon tokens)
(fe/pos-tag-features tokens)
(fe/dictionary-features tokens)
(fe/tree-features panon)
(fe/srl-features tokens)))))
(defn create-feature-sets []
(->> (adb/anons)
(map #(merge {:sa (:class-label %)
:utterance (->> % :annotation :text)}
(create-features (:annotation %))))))
(defn feature-metas []
(concat (fe/verb-feature-metas)
(fe/token-feature-metas)
(fe/pos-tag-feature-metas)
(fe/dictionary-feature-metas)
(fe/tree-feature-metas)
(fe/srl-feature-metas)))
(defn- class-feature-meta []
[:sa ["answer" "question" "expressive"]])
In this example we call adb/anons
to return the parsed corpus data (see the
annotation library documentation for how to generate the corpus cache).
Next we create the model configuration (not the model yet). The configuration gives the framework what needs to create the feature set to generate an weka.core.Instances used by Weka to create, test and utilize the model.
(defn create-model-config []
{:name "speech-act"
:create-feature-sets-fn create-feature-sets
:create-features-fn create-features
:feature-metas-fn feature-metas
:class-feature-meta-fn class-feature-meta
:model-return-keys #{:label :distributions :features}})
The model configuration is a map that refers to functions we already created and some other metadata.
Next we define our features and classifiers.
After the namesapce declaration we define feature-sets-set
, which is a
two level hierarchy of features that have the same names as those given in
the feature-metas
function. The levels are:
- Feature metadata sets set: list of lists with each list is iterated on while cross-validating to find the feature set to fit the model.
- Feature metadata set: the list of features used to create a model for the current feature metadata set iteration.
We create a classifiers
binding to store what genera of classifiers we want
to use. See the classifiers dynamic binding
for more information.
(ns zensols.example.sa-eval
(:require [zensols.model.execute-classifier :refer (with-model-conf)]
[zensols.model.eval-classifier :as ec])
(:require [zensols.example.sa-feature :as sf]))
(defn feature-sets-set []
{:set-1 '((token-count))
:set-2 '((token-count
pos-tag-ratio-verb
pos-tag-ratio-adverb
pos-tag-ratio-noun
pos-tag-ratio-adjective))
:set-3 '((token-count stopword-count)
(token-count
pos-tag-ratio-noun
pos-tag-ratio-wh
pos-first-tag
stopword-count))})
(def classifiers [:zeror :fast])
Next we add an atom to store the weka.core.Instances
object so we can speed
up our feature/classifier configuration without having to regenerate feature
sets for each model testing iteration.
(def cross-fold-instances-inst (atom nil))
Finally we extend the model configuration with the Instances
atom and our
feature metadata sets.
(defn- create-model-config []
(merge (sf/create-model-config)
{:cross-fold-instances-inst cross-fold-instances-inst
:feature-sets-set (feature-sets-set)}))
While this step isn't necessary, you'll want to do it to see how well the model performs and optimize it by changing the feature set or swapping and/or tweaking the classifier.
Technically speaking, the actual in memory model is not yet created, but we have now set up everything the framework needs to use it.
Let's start by writing out an ARFF file:
(with-model-conf (create-model-config)
(ec/write-arff))
By default the system uses cross validation. To train the
model with the training data, then test on the training set you must bind
*default-set-type*
to :train-test
.
Note that we need to wrap everything in a with-model-conf
, which is the way
the framework receives our model configuration. In practice you'll wrap more
than just one statement and do several things in the lexical context of a
with-model-conf
.
Now let's invoke a cross validation and get just an F-measure score:
(with-model-conf (create-model-config)
(ec/terse-results classifiers :set-3 :only-stats? true))
This performs a ten fold cross validation using the using two feature sets:
token-count
andstopword-count
token-count
,pos-tag-ratio-noun
,pos-tag-ratio-wh
,pos-first-tag
,stopword-count
For both models it tests with the zeror
and fast
classifiers. The first
zeror
is a majority rule classifier usually used to generate a baseline to
gauge relative performance gains. The fast
classifier are a group of
classifiers that train fast. See the
classifiers dynamic binding
for more information on classifier genres.
You might have a large dataset and choose to use classifiers that take a long time to train. This is all multiplied by feature metadata set cardinally, which drastically compoundes the run time. In these situations you might want to leave it running for a while and generate a spreadsheet report as output. This report contains the feature set, classifiers used, and performance metrics.
(with-model-conf (create-model-config)
(ec/eval-and-write classifiers :huge-meta-set))
Once you're happy with the performance of your model you can save it and use it in the same or different JVM instance.
(with-model-conf (create-model-config)
(->> (ec/create-model classifiers :set-best)
ec/train-model
ec/write-model))
This creates a binary model file where you've configured the model output directory. More information on how to configure see the code example in the example project repository.
The information encoded in this file includes:
- The trained classifier
- The features of the model
- Performance metrics like F-measure, recall, precision, predictions
- The context created with the model configuration's :context-fn function
First let's create a namespace to work with our new model and a function to create that model:
(ns zensols.example.sa-model
(:require [zensols.model.execute-classifier :as exc :refer (with-model-conf)]
[zensols.nlparse.parse :as p])
(:require [zensols.example.sa-feature :as sf]))
(def model-inst (atom nil))
(defn- model []
(swap! model-inst
#(or %
(with-model-conf (sf/create-model-config)
(exc/prime-model (exc/read-model))))))
Since the details of the previous model is encoded in binary you won't be able to look at the file to make sense of it. However, you can output the contents of the model (to the REPL and a file respectively) including everything mentioned in the previous section:
(exc/print-model-info (model))
(exc/dump-model-info (model))
which yields:
instances-total: 382.0
instances-correct: 366.0
instances-incorrct: 16.0
name: speech-act
create-time: Mon Jul 25 12:19:24 CDT 2016
accuracy: 95.81151832460733
wprecision: 0.9585456987726836
wrecall: 0.9581151832460733
wfmeasure: 0.9580599320074707
features:
(:token-count :pos-tag-ratio-noun :pos-tag-ratio-wh :pos-first-tag :pos-last-tag :stopword-count)
classifier: ...
Finally, we can parse an utterance and use its features to classify our speech act:
(->> (p/parse "when are we getting there")
(exc/classify (model))
pprint)
which yeilds:
{:features-set
{:pos-last-tag "RB",
...
:pos-tag-ratio-wh 0},
:label "question",
:distributions
{"answer" 0.033959200182016515,
"question" 0.9306213420827363,
"expressive" 0.03541945773524721}}
This gives us all the results we asked for in the :model-return-keys
of our
create-model-config
function in our
feature namespace, which is:
- :features the single instance features given to the model for classification,
which for us was generate by the
create-features
function. - :label is the class label for our classification, which in the case is correct for the utterance when are we getting there.
- :distributions is the probability distribution over the class label, which in our case is pretty uneven suggesting a high degree of confidence
The features are availble since there are situations where you might want to do something with a feature after classification. More specifically you could even generate features as data not used by the model like an ID for an NER tag.
The probability distribution could be handy for cases where you don't want the first choice or would like to use the distribution itself as a feature in another model or get an idea of specific classifications' performance.
Now we can create a client friendly (to our new library) function:
(defn classify-utterance [utterance]
(->> (p/parse utterance)
(exc/classify (model) anon)
:label))
In our example we performed with a weighted F-measure of 0.96, which seems pretty unbelievable. Another way to confirm we have a good model is to divide the dataset into a training and test set. For this example, let's split it right down the middle and retrain:
user=> (adb/divide-by-set 0.5)
user=> (reset! instance-inst nil) ; invalidate the instances cache
user=> (with-model-conf (create-model-config)
(->> (ec/create-model classifiers :set-best)
ec/train-model
ec/write-model))
user=> (reset! model-inst nil) ; invalidate the model
user=> (exc/print-model-info)
yeilds:
instances-total: 239.0
instances-correct: 225.0
instances-incorrct: 14.0
name: speech-act
create-time: Mon Jul 25 12:51:18 CDT 2016
accuracy: 94.14225941422595
wprecision: 0.9413853837630445
wrecall: 0.9414225941422594
wfmeasure: 0.9413703416300443
which is still very good and still hard to believe how well it performs. However, now we have a better way to prove the model, which is to run it on data we left out, which is the training data. We'll code to invoke the model classifier on the test data:
(ns zensols.example.sa-model
(:require [clj-excel.core :as excel])
(:require [zensols.actioncli.dynamic :refer (dyn-init-var) :as dyn]
[zensols.actioncli.log4j2 :as lu]
[zensols.actioncli.resource :as res]
[zensols.util.spreadsheet :as ss]
[zensols.model.execute-classifier :as exc :refer (with-model-conf)]
[zensols.example.anon-db :as adb]))
(def preds-inst (atom nil))
(defn- test-annotation [anon-rec]
(let [{anon :instance label :class-label} anon-rec
sent (:text anon)
pred (classify-utterance sent)]
(log/debugf "label: %s, prediction: %s" label pred)
{:label label
:sent sent
:prediction pred
:correct? (= label pred)}))
(defn- predict-test-set []
(swap! preds-inst
#(or %
(let [anons (adb/anons :set-type :test)
results (map test-annotation anons)
preds (map :correct? results)]
{:correct (filter true? preds)
:incorrect (filter false? preds)
:predictions preds
:results results}))))
(defn- create-prediction-report []
(letfn [(data-sheet [anons]
(->> anons
(map (fn [anon]
[(:class-label anon) (->> anon :instance :text)]))
(cons ["Label" "Utterance"])))]
(let [out-file (res/resource-path :analysis-report "sa-predictions.xls")]
(-> (excel/build-workbook
(excel/workbook-hssf)
{"Predictions on test data"
(->> (predict-test-set)
:results
(map (fn [res]
(let [{:keys [label sent prediction correct?]} res]
[correct? label prediction sent])))
(cons ["Is Correct" "Gold Label" "Prediction" "Utterance"])
(ss/headerize))
"Training" (data-sheet (adb/anons))
"Test" (data-sheet (adb/anons :set-type :test))})
(ss/autosize-columns)
(excel/save out-file)))))
(create-prediction-report)
Invoking this code creates a report on the desktop with the training data and its predictions on the first sheet and the dataset by its set type (training and testing) on the second two tabs. We'll still correctly classify 230 of the 238 giving a 96% accuracy.
There is an easier way to test and train our model by using the
clj-ml-dataset
library, but first we have to make a few changes. The
create-feature-sets
function we wrote earlier needs to take a test/train
ratio parameter so the data set library can re-create the training and testing
sets:
(defn create-feature-sets [& adb-keys]
(->> (apply adb/anons adb-keys)
(map #(merge {:sa (:class-label %)
:utterance (->> % :instance :text)}
(create-features (:instance %))))))
The adb-keys
are the keys that eventually get passed to the
instances function.
In our evaluation code we need to create a new atom to cache the results of the testing and training instances:
(dyn-init-var *ns* 'test-train-instances-inst (atom nil))
This atom needs to be added to the model configuration. We also need to tell the framework how to repartition the training and testing data sets and clear the train/test atom that caches the instances:
(defn- create-model-config []
(letfn [(divide-by-set [divide-ratio]
(adb/divide-by-set divide-ratio :shuffle? false)
(reset! test-train-instances-inst nil))]
(merge (sf/create-model-config)
{:cross-fold-instances-inst cross-fold-instances-inst
:test-train-instances-inst test-train-instances-inst
:feature-sets-set (feature-sets-set)
:divide-by-set divide-by-set})))
The divide-by-set
function defined above creates a new division of testing
and training data and in our case will incrementall move instances from the
training data to the testing data. With the shuffle? false
we do not
shuffle the data set before making the new split so we are effectively
re-partitioning by moving the train/test data set demarcation line.
Now we're ready to call the framework to train the classifier on the training instances and then test the trained classifier on the test instances:
(binding [cl/*rand-fn* (fn [] (java.util.Random. 1))]
(with-model-conf (create-model-config)
(->> (ec/train-test-series
[:j48] :set-best {:start 0.1 :stop 1 :step 0.05})
ec/write-csv-train-test-series)))
In this example, the cl/*rand-fn*
tells the framework to use 1
as the seed
so the ordering of the instances across training/testing data is always the
same, which means if running the same tests (including cross validation)
doesn't change our outcomes.
The ec/write-csv-train-test-series
writes the result outcomes to a CSV file,
which we can then use to find the elbow or point where we start to overfit
the model. The R
code to do this and the results are in the
example project repository. This code creates the following graph:
In the graph we see the we have just below 0.4 F-measuer for 48 training instances, it then ballons to above 0.9 at 72 instances so the classifier (J48 decision tree for this example) learns quickly. However we see the first drop at 120 training instances (the red portion), which is mentioned elbow where we typically see the classifier start to overtrain.
To build from source, do the folling:
- Install Leiningen (this is just a script)
- Install GNU make
- Install Git
- Download the source:
git clone https://github.com/clj-mkproj && cd clj-mkproj
- Download the make include files:
mkdir ../clj-zenbuild && wget -O - https://api.github.com/repos/plandes/clj-zenbuild/tarball | tar zxfv - -C ../clj-zenbuild --strip-components 1
- Build the distribution binaries:
make dist
Note that you can also build a single jar file with all the dependencies with: make uber
I am suspicious there is a bug with the two pass functionality as I've
recently worked with a data set that gave very different performance results
using a test/train split. While I'm not sure, I suspect the bug is somewhere
in the Clojure -> Weka -> Clojure flow. Currently, two pass works by
overloading the Instances
class and another that uses the overridden class,
and subsequently doesn't copy that class. This is invoked from the
zensols.model.weka/clone-instances
function that uses the model framework to
create separate train and test folds.
I've been over and over this code and can't find the bug. There are any Weka wizards out there that have some time and can help out, I'd really appreciate it!
An extensive changelog is available here.
MIT License
Copyright (c) 2016 - 2018 Paul Landes
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, 608A DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.