Naïve Bayes

Extended documentation can be found on the website: https://majkamichal.github.io/naivebayes/

Naïve Bayes

1. Overview

The naivebayes package provides an efficient implementation of the popular Naïve Bayes classifier in R. It was developed and is now maintained based on three principles: it should be efficient, user friendly and written in Base R. The last implies no dependencies, however, it neither denies nor interferes with being efficient as many functions from the Base R distribution use highly efficient routines programmed in lower level languages, such as C or FORTRAN. In fact, the naivebayes package utilizes only such functions for resource-intensive calculations.

The general function naive_bayes() detects the class of each feature in the dataset and, depending on the user choices, assumes possibly different distribution for each feature. It currently supports following class conditional distributions:

categorical distribution for discrete features

Poisson distribution for non-negative integers

Gaussian distribution for continuous features

non-parametrically estimated densities via Kernel Density Estimation for continuous features

In addition to that specialized functions are available which implement:

Bernoulli Naive Bayes via bernoulli_naive_bayes()

Multinomial Naive Bayes via multinomial_naive_bayes()

Poisson Naive Bayes via poisson_naive_bayes()

Gaussian Naive Bayes via gaussian_naive_bayes()

Non-Parametric Naive Bayes via nonparametric_naive_bayes()

They are implemented based on the linear algebra operations which makes them efficient on the dense matrices. They can also take advantage of sparse matrices to furthermore boost the performance. Also few helper functions are provided that are supposed to improve the user experience. The general naive_bayes() function is also available through the excellent Caret package.

2. Installation

Just like many other R packages, naivebayes can be installed from the CRAN repository by simply executing in the console the following line:

install.packages("naivebayes") # Or the the development version from GitHub: devtools::install_github("majkamichal/naivebayes")

3. Usage

The naivebayes package provides a user friendly implementation of the Naïve Bayes algorithm via formula interlace and classical combination of the matrix/data.frame containing the features and a vector with the class labels. All functions can recognize missing values, give an informative warning and more importantly - they know how to handle them. In following the basic usage of the main function naive_bayes() is demonstrated. Examples with the specialized Naive Bayes classifiers can be found in the extended documentation: https://majkamichal.github.io/naivebayes/

3.1 Example data

library(naivebayes) # Simulate example data n <- 100 set.seed(1) data <- data.frame(class = sample(c("classA", "classB"), n, TRUE), bern = sample(LETTERS[1:2], n, TRUE), cat = sample(letters[1:3], n, TRUE), logical = sample(c(TRUE,FALSE), n, TRUE), norm = rnorm(n), count = rpois(n, lambda = c(5,15))) train <- data[1:95, ] test <- data[96:100, -1]

3.2 Formula interface

nb <- naive_bayes(class ~ ., train) summary(nb) #> #> ================================ Naive Bayes ================================= #> #> - Call: naive_bayes.formula(formula = class ~ ., data = train) #> - Laplace: 0 #> - Classes: 2 #> - Samples: 95 #> - Features: 5 #> - Conditional distributions: #> - Bernoulli: 2 #> - Categorical: 1 #> - Gaussian: 2 #> - Prior probabilities: #> - classA: 0.5263 #> - classB: 0.4737 #> #> ------------------------------------------------------------------------------ # Classification predict(nb, test, type = "class") #> [1] classB classA classA classA classA #> Levels: classA classB nb %class% test #> [1] classB classA classA classA classA #> Levels: classA classB # Posterior probabilities predict(nb, test, type = "prob") #> classA classB #> [1,] 0.4998488 0.5001512 #> [2,] 0.5934597 0.4065403 #> [3,] 0.6492845 0.3507155 #> [4,] 0.5813621 0.4186379 #> [5,] 0.5087005 0.4912995 nb %prob% test #> classA classB #> [1,] 0.4998488 0.5001512 #> [2,] 0.5934597 0.4065403 #> [3,] 0.6492845 0.3507155 #> [4,] 0.5813621 0.4186379 #> [5,] 0.5087005 0.4912995 # Helper functions tables(nb, 1) #> #> ------------------------------------------------------------------------------ #> ::: bern (Bernoulli) #> ------------------------------------------------------------------------------ #> #> bern classA classB #> A 0.4400000 0.4888889 #> B 0.5600000 0.5111111 #> #> ------------------------------------------------------------------------------ get_cond_dist(nb) #> bern cat logical norm count #> "Bernoulli" "Categorical" "Bernoulli" "Gaussian" "Gaussian" # Note: all "numeric" (integer, double) variables are modelled # with Gaussian distribution by default.

3.3 Matrix/data.frame and class vector

X <- train[-1] class <- train$class nb2 <- naive_bayes(x = X, y = class) nb2 %prob% test #> classA classB #> [1,] 0.4998488 0.5001512 #> [2,] 0.5934597 0.4065403 #> [3,] 0.6492845 0.3507155 #> [4,] 0.5813621 0.4186379 #> [5,] 0.5087005 0.4912995

3.4 Non-parametric estimation for continuous features

Kernel density estimation can be used to estimate class conditional densities of continuous features. It has to be explicitly requested via the parameter usekernel=TRUE otherwise Gaussian distribution will be assumed. The estimation is performed with the built in R function density(). By default, Gaussian smoothing kernel and Silverman’s rule of thumb as bandwidth selector are used:

nb_kde <- naive_bayes(class ~ ., train, usekernel = TRUE) summary(nb_kde) #> #> ================================ Naive Bayes ================================= #> #> - Call: naive_bayes.formula(formula = class ~ ., data = train, usekernel = TRUE) #> - Laplace: 0 #> - Classes: 2 #> - Samples: 95 #> - Features: 5 #> - Conditional distributions: #> - Bernoulli: 2 #> - Categorical: 1 #> - KDE: 2 #> - Prior probabilities: #> - classA: 0.5263 #> - classB: 0.4737 #> #> ------------------------------------------------------------------------------ get_cond_dist(nb_kde) #> bern cat logical norm count #> "Bernoulli" "Categorical" "Bernoulli" "KDE" "KDE" nb_kde %prob% test #> classA classB #> [1,] 0.6252811 0.3747189 #> [2,] 0.5441986 0.4558014 #> [3,] 0.6515139 0.3484861 #> [4,] 0.6661044 0.3338956 #> [5,] 0.6736159 0.3263841 # Class conditional densities plot(nb_kde, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")

# Marginal densities plot(nb_kde, "norm", arg.num = list(legend.cex = 0.9), prob = "marginal")

3.4.1 Changing kernel

In general, there are 7 different smoothing kernels available:

gaussian

epanechnikov

rectangular

triangular

biweight

cosine

optcosine

and they can be specified in naive_bayes() via parameter additional parameter kernel. Gaussian kernel is the default smoothing kernel. Please see density() and bw.nrd() for further details.

# Change Gaussian kernel to biweight kernel nb_kde_biweight <- naive_bayes(class ~ ., train, usekernel = TRUE, kernel = "biweight") nb_kde_biweight %prob% test #> classA classB #> [1,] 0.6237152 0.3762848 #> [2,] 0.5588270 0.4411730 #> [3,] 0.6594737 0.3405263 #> [4,] 0.6650295 0.3349705 #> [5,] 0.6631951 0.3368049 plot(nb_kde_biweight, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")

3.4.2 Changing bandwidth selector

The density() function offers 5 different bandwidth selectors, which can be specified via bw parameter:

nrd0 (Silverman’s rule-of-thumb)

nrd (variation of the rule-of-thumb)

ucv (unbiased cross-validation)

bcv (biased cross-validation)

SJ (Sheather & Jones method)

nb_kde_SJ <- naive_bayes(class ~ ., train, usekernel = TRUE, bw = "SJ") nb_kde_SJ %prob% test #> classA classB #> [1,] 0.7279209 0.2720791 #> [2,] 0.4858273 0.5141727 #> [3,] 0.7004134 0.2995866 #> [4,] 0.7005704 0.2994296 #> [5,] 0.7089626 0.2910374 plot(nb_kde_SJ, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")

3.4.3 Adjusting bandwidth

The parameter adjust allows to rescale the estimated bandwidth and thus introduces more flexibility to the estimation process. For values below 1 (no rescaling; default setting) the density becomes “wigglier” and for values above 1 the density tends to be “smoother”:

nb_kde_adjust <- naive_bayes(class ~ ., train, usekernel = TRUE, adjust = 0.5) nb_kde_adjust %prob% test #> classA classB #> [1,] 0.6636171 0.3363829 #> [2,] 0.4784302 0.5215698 #> [3,] 0.6442293 0.3557707 #> [4,] 0.6745416 0.3254584 #> [5,] 0.7533994 0.2466006 plot(nb_kde_adjust, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")

3.5 Model non-negative integers with Poisson distribution

Class conditional distributions of non-negative integer predictors can be modelled with Poisson distribution. This can be achieved by setting usepoisson=TRUE in the naive_bayes() function and by making sure that the variables representing counts in the dataset are of class integer.

is.integer(train$count) #> [1] TRUE nb_pois <- naive_bayes(class ~ ., train, usepoisson = TRUE) summary(nb_pois) #&g 622A t; #> ================================ Naive Bayes ================================= #> #> - Call: naive_bayes.formula(formula = class ~ ., data = train, usepoisson = TRUE) #> - Laplace: 0 #> - Classes: 2 #> - Samples: 95 #> - Features: 5 #> - Conditional distributions: #> - Bernoulli: 2 #> - Categorical: 1 #> - Poisson: 1 #> - Gaussian: 1 #> - Prior probabilities: #> - classA: 0.5263 #> - classB: 0.4737 #> #> ------------------------------------------------------------------------------ get_cond_dist(nb_pois) #> bern cat logical norm count #> "Bernoulli" "Categorical" "Bernoulli" "Gaussian" "Poisson" nb_pois %prob% test #> classA classB #> [1,] 0.4815380 0.5184620 #> [2,] 0.4192209 0.5807791 #> [3,] 0.6882270 0.3117730 #> [4,] 0.4794415 0.5205585 #> [5,] 0.5209152 0.4790848 # Class conditional distributions plot(nb_pois, "count", prob = "conditional")

# Marginal distributions plot(nb_pois, "count", prob = "marginal")

Name		Name	Last commit message	Last commit date
Latest commit History 234 Commits
R		R
README_cache		README_cache
docs		docs
index_cache		index_cache
inst		inst
man		man
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.travis.yml		.travis.yml
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
index.Rmd		index.Rmd
index.md		index.md
naivebayes.Rproj		naivebayes.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Naïve Bayes

1. Overview

2. Installation

3. Usage

3.1 Example data

3.2 Formula interface

3.3 Matrix/data.frame and class vector

3.4 Non-parametric estimation for continuous features

3.4.1 Changing kernel

3.4.2 Changing bandwidth selector

3.4.3 Adjusting bandwidth

3.5 Model non-negative integers with Poisson distribution

About

Uh oh!

Releases

Packages

Languages

License

RicUIB/naivebayes

Folders and files

Latest commit

History

Repository files navigation

Naïve Bayes

1. Overview

2. Installation

3. Usage

3.1 Example data

3.2 Formula interface

3.3 Matrix/data.frame and class vector

3.4 Non-parametric estimation for continuous features

3.4.1 Changing kernel

3.4.2 Changing bandwidth selector

3.4.3 Adjusting bandwidth

3.5 Model non-negative integers with Poisson distribution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages