Home

Overview

Netlib is a machine-learning framework research project written almost entirely in Swift with a small amount of C code to interface with cuDNN, Cuda, and other legacy system libraries. The project goals are:

Simplify the model development process so engineers are able to successfully create and deploy models directly in their applications
Discover methods to directly leverage standard application libraries and development tools (Cocoa, Xcode, etc.)
Make it possible to create reusable learned functions that can be stored in a database and later used “as is” or aggregated to form more complex models.
Develop a faster execution model to reduce training times and speed up hypothesis testing
Create an architecture that can support all data types and allow them to be combined into mixed models (image, audio, motion, etc.)
Create valuable new IP, and identify opportunities for additional IP exploration

Swiftness

Netlib, the Swift language, and tools have been developed concurrently. Several design decisions were made along the way that are not ideally Swift in order to work around compiler crashes, LLDB crashes, run-time library crashes, unimplemented Foundation features on Linux, and performance problems. Most problems have been related to generics, protocols, and protocol extensions. Implementing objects as Swift classes and using inheritance provided a work around for most problems.

Installation

The following are the Linux setup instructions.

Install the latest Swift 5.0 development tools The 1/13/19 master snapshot works. Some other builds have compiler bugs.
Install Cuda 10.0 for Ubuntu 16.04

First make sure your graphics card driver is up to date!
Then install

sudo dpkg -i cuda-repo-ubuntu1604-10-0-local-10.0.130-410.48_1.0-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda

Install cuDNN 7.4

First visit the NVIDIA cuDNN download site and register
Download "cuDNN v7.4 Library for Linux" for Cuda 10.0
Then install

sudo tar -xzf cudnn-10.0-linux-x64-v7.4.1.5.tgz -C /usr/local
rm cudnn-10.0-linux-x64-v7.4.1.5.tgz
sudo ldconfig

Install Netlib dependencies

sudo apt-get install git libzip-dev libpng-dev libjpeg-dev liblmdb-dev libblocksruntime0 curl

Clone the repository

git clone https://github.com/ewconnell/Netlib.git

Optionally install the CLion IDE with the Swift plugin. Netlib on Linux uses CMake to drive the Swift Package Manager build process. CLion makes it easy to edit/build/debug
Test the installation
The trainXmlModel test program performs a training run and displays the results. The diagnosticExample performs 5 training runs and displays the mean training time. The release builds are much faster!

import Foundation
import Netlib

do {
  let fileName = "samples/mnist/mnistClassifierSolver"
  guard let path = Bundle.main.path(forResource: fileName, ofType: "xml") else {
    fatalError("Resource not found: \(fileName)")
  }

  let model = try Model(contentsOf: URL(fileURLWithPath: path))
  try model.train()

} catch {
  print(String(describing: error))
}

Building this project required combining make with the Swift Package Manager build system. Eventually the SPM should be able to do the whole job.

In this example we'll use ~/Documents/Netlib as the repository location
For a release build

cd ~/Documents/Netlib
mkdir release && cd release
cmake -DCMAKE_BUILD_TYPE=Release -G "CodeBlocks - Unix Makefiles" ..
make trainXmlModel -j12

For a debug build

cd ~/Documents/Netlib
mkdir debug && cd debug
cmake -DCMAKE_BUILD_TYPE=Debug -G "CodeBlocks - Unix Makefiles" ..
make trainXmlModel -j12

The Swift Package Manager puts the binaries in the ".build" directory. To run the release example

~/Documents/Netlib/.build/release/trainXmlModel

To run the debug example

~/Documents/Netlib/.build/debug/trainXmlModel

Note: the diagnosticExample performs 5 runs and displays the mean training time. The first run is excluded to remove the time Cuda needs to initialize, which is about 2 seconds. The first run is also excluded in the python benchmark tests for Caffe2 and TensorFlow.

Initial Performance

Netlib, Caffe2, and TensorFlow all use NVIDIA Cuda and cuDNN to schedule work on the GPU. Caffe2 and TensorFlow are not as efficient as they could be, and require a significant amount of CPU time to schedule work on the GPU. As GPU performance increases, a significant GPU utilization gap forms due to framework overhead on the CPU.

Image and video applications have large feature vectors, but I believe there are also a vast number of applications that have relatively small feature vectors. In any case a more efficient framework will have lower host requirements and lower power consumption. This will become increasingly important as faster GPUs emerge. A more efficient framework will lessen the need to upgrade the host to keep pace.

Below is a performance comparison for MNIST training, with Netlib showing a 5X to 14X performance advantage.

Test system:

Ubuntu 16.04
Xeon 3.4ghz 6 core
32gb RAM
NVIDIA Pascal Titan Xp
512gb m.2 SSD

All of the test cases are evaluating the same model architecture.

training samples: 60,000
validation test samples: 10,000
epochs: 10
training iterations/epoch: 1000
training batch size: 60
validation pass/epoch: 1
validation test batch size: 1000

Framework	Avg GPU utilization	Mean training time	GPU Memory	Slower
Swift4TF	15%	86.2 sec	8GB	13.7x
KerasTF	23%	56.3 sec	All 12GB	8.9x
Caffe2	45%	31.5 sec		5.0x
Netlib	97%	6.3 sec	640MB

Times vary slightly from run to run. Netlib times range from 5.9 - 6.8 seconds.

Result Accuracy

There is an inconsistency with training accuracy. I suspect it's related to a variation in the SGD update method and needs to be investigated further.

	validation set accuracy
Caffe1	94.8%
Netlib	96.5%
Caffe2	98.9%
CNTK 2.2	98.9%
TensorFlow	98.9%
Keras/TF	98.9%

Language Choice

The Swift language was chosen for

Host portability
Compact language representation
High performance code generation
Interactive playground environment for scripting similar to python
Unlike Python or C++, it will allow models to be directly connected to UI code

Swift vs Python

Caffe2 and TensorFlow rely on python to provide an environment for testing and experimentation. Python and all of the required support packages have disadvantages:

Bigger and slower than compiled code
Require the user to learn another language, non-standard app libraries, and tools
Require the user to install and configure many dependent packages
Cannot be directly integrated with the application

At the moment, Core ML addresses many of these problems by splitting apart training and deployment. This is a reasonable intermediate step until CoreML can offer it’s own training support. In the meantime, certain applications that could do online training locally will be precluded.

Since Netlib is pure Swift, experimentation can be done using Swift playgrounds and apps can be built to leverage all of the iOS/MacOS libraries for visualization and custom UI. New insights and productivity can be achieved if first rate UI applications are built taking full advantage of app libraries. I don't believe Python can ever effectively be used in this way.

Creating Models

Netlib models can be created and trained entirely in Swift code or through an XML definition. A model is a directed acyclic graph composed of computational elements. The Model and Function elements are containers that can be arbitrarily nested and aggregated using URL references.

Reusability

By allowing models to be defined through XML data, they can be created and run without any special tools. Additionally they can be stored in a database as separately trained reusable pieces and dynamically aggregated to form more complex models. A repository of reusable expertly designed functions (model fragments) could enable drag and drop creation of complex models.

Examples

Below is an xml example of an MNIST classifier function. Element inputs are implied by collection order if not otherwise explicitly specified. See Containment and Connections for more detail.

<?xml version="1.0" encoding="utf-8"?>
<Function>
  <items>
    <Softmax labelsInput=".labelsInput"/>
    <FullyConnected outputChannels="10"/>
    <Activation/>
    <FullyConnected outputChannels="500"/>
    <Pooling/>
    <Convolution filterSize="5" outputChannels="50"/>
    <Pooling/>
    <Convolution filterSize="5" outputChannels="20"/>
  </items>
</Function>

To load and train a model from XML

let fileName = "samples/mnist/mnistClassifierSolver"
guard let path = Bundle.main.path(forResource: fileName, ofType: "xml") else { exit(1) }

let model = try Model(contentsOf: URL(fileURLWithPath: path), logLevel: .status)
try model.train()

The same function can also easily be defined in code.

import Foundation
import Netlib

let mnistClassifier = Function {
  $0.name = "mnistClassifier"
  $0.labelsInput = "trainingDb.labels"
  $0.items.append([
    Softmax { $0.labelsInput = ".labelsInput" },
    FullyConnected { $0.outputChannels = 10 },
    Activation(),
    FullyConnected { $0.outputChannels = 500 },
    Pooling(),
    Convolution { $0.outputChannels = 50; $0.filterSize = [5] },
    Pooling(),
    Convolution { $0.outputChannels = 20; $0.filterSize = [5] },
  ])
}

Model Elements and Containers

Model elements have named inputs and outputs and are used to perform actions. Most elements are filters with a single input and output data connector. Source elements like Database have no inputs, but define data and labels outputs.

The ModelElementContainer class is the base for the Model and Function classes. It is used for namespace organization, the Defaults collection, and abstraction of inputs and outputs to aid in reusability. A ModelElementContainer is conceptually used like any other element. During the setup process connection names are resolved and elemen 8000 ts are directly connected as a flat computational graph.

So the following resolve to the same flat graph

<Function>
  <items>
    <Function>
      <items>
        <Softmax labelsInput=".labelsInput"/>
        <FullyConnected outputChannels="10"/>
        <Activation/>
        <FullyConnected outputChannels="500"/>
      </items>
    </Function>

    <Function>
      <items>
        <Pooling/>
        <Convolution filterSize="5" outputChannels="50"/>
        <Pooling/>
        <Convolution filterSize="5" outputChannels="20"/>
      </items>
    </Function>
  </items>
</Function>

<Function>
  <items>
    <Softmax labelsInput=".labelsInput"/>
    <FullyConnected outputChannels="10"/>
    <Activation/>
    <FullyConnected outputChannels="500"/>
    <Pooling/>
    <Convolution filterSize="5" outputChannels="50"/>
    <Pooling/>
    <Convolution filterSize="5" outputChannels="20"/>
  </items>
</Function>

Connections

Connections between elements can be:

Explicit - an input source is defined by elementName.connectorName. If the element name is omitted, then the element container is implied.

<Pooling input=”conv1.data”/>
<Convolution name=”conv1” filterSize="5" outputChannels="20"/>

Partially explicit - the input element is specified but the data connector is implied.

<Pooling input=”conv1”/>
<Convolution name=”conv1” filterSize="5" outputChannels="20"/>

Implied – if the input is not specified then the output of the next element is the implied source.

<Pooling/>
<Convolution filterSize="5" outputChannels="20"/>

Deferred – elements can connect to container connectors, which won’t be resolved until setup time. This allows a function to be reused. The Softmax labelsInput is connected to the Function (container) labelsInput connector.

<Function>
  <items>
    <Softmax labelsInput=".labelsInput"/>
    <FullyConnected outputChannels="10"/>
  </items>
</Function>

Elements are evaluated as a whole for forward and backward passes in order of dependency. Each forward pass has a unique ID associated with the request to protect against multiple evaluation of the same element. During setup an exception will be thrown if a dependency cycle is detected.

When an element is evaluated, all inputs are assured to be available and all outputs must be produced.

Parallel execution of independent branches is possible, but so far there hasn’t been a performance advantage due to thread scheduling overhead. DispatchQueue.concurrentPerform has disappointing performance.

Templates

Complex models are frequently pipelines of repeated clusters or elements, perhaps with slightly varied properties, so their representation can be greatly simplified through the use of templates. The Caffe prototxt for VGG16 net is about 500 lines, which obscures any clear appearance of the model’s architecture. The following Netlib VGG16 net xml example shows a compact easy to understand model definition in about 45 lines. It is extremely easy to rearrange and modify parameters.

<?xml version="1.0" encoding="utf-8"?>
<Function name="vgg16">
  <templates>
    <Function name="layer_12">
      <items>
	<Pooling/>
	<Convolution/>
	<Convolution/>
      </items>
    </Function>

    <Function name="layer_345">
      <items>
	<Pooling/>
	<Convolution/>
	<Convolution/>
	<Convolution/>
      </items>
    </Function>

    <Function name="layer_67">
      <items>
	<Dropout/>
	<Activation/>
	<FullyConnected outputChannels="4096"/>
      </items>
    </Function>
  </templates>

  <items>
    <Function defaultValues="Convolution.pad: 1, Convolution.activationMode: relu">
      <items>
	<Softmax labelsInput=".labelsInput"/>
	<FullyConnected outputChannels="1000"/>
	<Function template="layer_67"/>
	<Function template="layer_67"/>
	<Function template="layer_345" defaultValues=".outputChannels: 512"/>
	<Function template="layer_345" defaultValues=".outputChannels: 512"/>
	<Function template="layer_345" defaultValues=".outputChannels: 256"/>
	<Function template="layer_12"  defaultValues=".outputChannels: 128"/>
	<Function template="layer_12"  defaultValues=".outputChannels: 64"/>
      </items>
    </Function>
  </items>
</Function>

A Function element is used to contain a collection of child computational elements. It can also be used as a template to create multiple instances of a function.

Templates may be defined in a container’s templates collection, in the items collection, or externally through a URL. Templates can greatly simplify model understanding and design through this compact representation.

During the setup phase, all templates names are resolved, and object instances are created and initialized by copying the properties of the source template object. Only properties that are not explicitly set on the template reference will be copied from the source. Templates may also be chained.

In the following example the above function is incorporated into a larger model that contains a Solver, default properties, and Databases for training. The templateUri property on the vgg16Classifier function is used to specify the external template definition.

The Test function also demonstrates declaring a Function as a template of another function, which is an item higher up in the namespace tree.

<Function template="vgg16Classifier" labelsInput="validationDb.labels"/>

This is a template instance chaining to the template instance of the external classifier function.

Events

The Event generic can be used to accept subscribers and notify them when an event occurs. Currently the AnyProperty protocol defines the changed event. The Event generic allows the developer to easily define and use new events wherever needed.

Properties

Models are directed acyclic graphs of computational elements. Each element optionally defines a set of properties for configuration and possibly to reflect current state.

At the time Swift Key/Value coding was not part of the language, so I had to create my own dynamic property model. A Properties object implements a dictionary of typed property objects implementing the AnyProperty protocol, which maintain version, provide string conversion, and expose the changed event. This is the definition for the Activation element properties.

public final class Activation : ComputableFilterBase, ActivationProperties, InitHelper {
  //----------------------------------------------------------------------------
  // properties
  public var mode = ActivationMode.relu         { didSet{onSet("mode")} }
  public var nan = NanPropagation.propagate	{ didSet{onSet("nan")} }
  public var reluCeiling = 0.0	                { didSet{onSet("reluCeiling")} }

  //----------------------------------------------------------------------------
  // addAccessors
  public override func addAccessors() {
    super.addAccessors()
    addAccessor(name: "mode",
                get: { [unowned self] in self.mode },
                set: { [unowned self] in self.mode = $0 })
    addAccessor(name: "nan",
                get: { [unowned self] in self.nan },
                set: { [unowned self] in self.nan = $0 })
    addAccessor(name: "reluCeiling",
                get: { [unowned self] in self.reluCeiling },
                set: { [unowned self] in self.reluCeiling = $0 })
  }
}

Properties are defined as normal Swift properties. Getting the value of a property incurs zero overhead cost. Setting the value of a property calls the onSet function, which performs a single dictionary lookup and increments the model and property’s version number. Setting properties has very low overhead and typically only done during model loading and configuration.

Note: at some point onSet should not need to specify the property name as a string. The compiler #function macro can be incorporated into the function signature as a default value. It is not done now, because use of #function causes a memory leak in the Swift libraries somewhere. A bug has been filed.

Accessors

An accessor must be defined for each property to enable automated XML and JSON serialization. Adding the accessor, adds an entry in the object’s property dictionary. This establishes a namespace hierarchy, and exposes a dynamic property model to interface with UI elements directly. At the time Swift Mirror APIs only allowed getting but not setting of properties, so it wasn’t a viable option. I believe Swift 4.0 is adding KV coding back in, but I haven’t investigated yet whether it is viable.

Serialization

The Model class implements the XmlConvertible and JsonConvertible protocols. XML is intended for the storage of model definitions, and JSON is intended for incremental update synchronization with remote models.

See Model and Property Versioning for serialization examples

Defaults

ModelElementContainers implement an optional Defaults collection of Default objects. A Default object defines a mapping between a "class.propertyPath" and a value or object. If the class is not specified, then it can match any class with the specified property.

When a model element is attached to a model, it's namespace context is set. Properties on the element that have not been explicitly set (they are default), are looked up from that point towards the root. If a Defaults collection is found, it is searched for a matching class.propertyPath name. If found, the associated value is assigned to the element property. This allows the user to easily configure a model, and even in an outer scope external to a reusable component.

The first three examples do not specify a class name, so they will match any class that has a matching property path. The last two will only match the Database and ImageCodec class properties.

<Function>
  <defaults>
    <Default property=".cacheDir" value="~/Documents/data/cache/mnist/"/>
    <Default property=".dataDir" value="~/Documents/data/mnist/"/>
    <Default property=".weights.fillMethod" value="xavier"/>
    <Default property="Database.dataType" value="real16F"/>
    <Default property="ImageCodec.format"
             object="{ImageFormat encoding='png' channelFormat='gray'/}"/>
  </defaults>
</Function>

The ModelELementContainer.defaultValues property is a dictionary that can be set as a convenience. It can only be used for default values, not objects. For example:

<Function defaultValues="Convolution.pad: 1, Convolution.activationMode: relu">
  <items>
    <Convolution/>
  </items>
</Function>

Is equal to

<Function>
  <defaults>
    <Default property="Convolution.pad" value="1"/>
    <Default property="Convolution.activationMode" value="relu"/>
  </defaults>
  <items>
    <Convolution/>
  </items>
</Function>

This feature is demonstrated in the vggnet xml example

Remote Models

I was interested in exploring the concept of loading or creating a model on one device and transparently running it remotely keeping the instance synchronized.

Heavy-duty training is primarily performed on Linux servers with multiple GPUs. Linux doesn’t support all of the rich Cocoa UI frameworks; so developing sophisticated UI for AI research isn’t really an option. Today, most model visualization is done via python popup graphs, which tend to be simplistic and not as insightful as they could be.

I wanted to explore the idea of writing an iPad or MacOS app for the data professional, which can load and run a model locally or remotely in the cloud depending on load requirements. Sophisticated “easy to use” apps can be created to enable a much wider audience for developing and deploying ML models.

Another scenario might involve the Apple Watch or IoT devices running apps that appear local, but run in the cloud or on another more powerful device.

Model and Property Versioning

Model and property versioning are required to enable multiple synchronized instances of a model, which may be distributed and casually connected.

Whenever a property is set, the onSet helper is called which increments the model’s version number and stamps the property with that version. The Model class implements the JsonConvertible protocol, which provides methods for incremental serialization and updates. When serializing, the caller specifies that they want all changes after a specified model version.

func asJson(after modelVersion: Int, include: SelectAnyOptions,
            options: JSONSerialization.WritingOptions) throws -> String

To apply incremental changes to a model, the update from string or stream functions can be used.

func update(fromJson string: String) throws
func update(fromJson stream: InputStream) throws

Multiple master example

The following is a simple example of how two model instances can be synchronized using incremental property updates. Additional logic would be needed to properly manage conflict resolution in a multi user shared model scenario.

import Foundation
import Netlib

do {
  // create master version
  let master = Model {
    $0.name = "my_model"
    $0.items.append([
      Function { $0.items.append([Pooling()]) },
      Convolution {
        $0.bias.uri = Uri(string: "bias.dat")
        $0.bias.uriDataExtent = [1, 1, 800, 500]
      }
    ])
  }
	
  let client = Model()
  let jsonMaster = try master.asJson(after: client.version)
  try client.update(fromJson: jsonMaster)
	
  var jsonClient = try client.asJson(options: .prettyPrinted)
  print(jsonClient)
	
  client.items.append(Convolution())
  jsonClient = try client.asJson(after: master.version)
  try master.update(fromJson: jsonClient)

} catch {
  print(String(describing: error))
}

Casually Connected Client

A client application could use a web service to load and execute a long running model in the cloud. At launch time the client and server model version numbers are identical. As the server instance runs, properties are periodically updated to reflect progress, which causes the server model version number to be incremented. At any point in the future, any number of client applications can query the server instance for property changes since the last time they were updated which is defined by their local model version number.

Databases

Training and validation databases are created using the Database element. The Database element is responsible for building a database by drawing data from a DataSource and using a DbProvider to operate on the database. A DataSource is used only once during the database build process. Subsequently all data requests from a consumer element are drawn from the database by the provider.

Database Diagram

The build process will perform any desired format and encoding transformations, along with optional verification for uniform size or number of channels.

The Database element has two modes of operation: streamed or cached output.

Streamed Output

If Database.streamOutput is set to true (default is false), then each batch of data is:

drawn from the database
decoded and type converted in parallel on the host
transported across the PCI bus to the GPU
a DataView to the selected data is returned

This uses the minimum amount of GPU memory, but requires each epoch to refetch, decode, and transport the same data over and over.

Cached Output

The default Database output mode is cached. If sufficient GPU memory is available, the entire database is decoded, type converted, and transported across the PCI bus to the GPU(s) only once. All subsequent data selections return a read only sub view of the data in memory. This is effectively returning a read only pointer with zero computational overhead. If memory is available, this mode is much faster.

Output Data Type

The Database.dataType property (default real32F) can be set to specify the output sample type. Specifying real16F will reduce all of the buffer sizes in half, if the extra precision is not needed.

Database Providers

Database providers implement the DbProvider protocol and are an abstraction that enables plugin database implementations. The currently implemented provider class is LmdbProvider, which implements the public protocols:

DbProvider
DbSession
DbData
DbDataCursor
DbTransaction
DbDataTransaction
DbDataTable

Data Sources

Data sources implement the DataSource protocol. A data source is responsible for providing a read only random access view of the data items they describe. Currently implemented data sources are:

FileList
Mnist
TinyImageNet

Data sources are responsible for downloading, unzipping, and parsing source archives, such as the case for Mnist and TinyImageNet.

Data Containers

The DataSource.getItem(at Index: function returns a ModelLabeledContainer that either contains or references the associated item data. Data containers have functions for encoding and decoding the data if needed.

public protocol ModelDataContainer : ModelObject, BinarySerializable {
  var codec: Codec? { get set }
  var codecType: CodecType? { get set }
  var extent: [Int]? { get set }
  var dataLayout: DataLayout { get set }
  var dataType: DataType { get set }
  var colMajor: Bool { get set }
  var shape: Shape { get }
  var sourceIndex: Int { get set }
  var transform: Transform? { get set }
  var uri: Uri? { get set }
  var uriString: String? { get set }
  var value: [Double]? 
8000
{ get set }

  func decode(data: inout DataView) throws
  func encode(completion: EncodedHandler) throws
  func encodedShape() throws -> Shape
}

Codecs

Data containers have an associated Codec that transforms the contained data into the desired form for database storage or use during training. The codec type is set by the DataSource based on the type of data. Codec settings can be easily specified in a Defaults collection. For example:

<?xml version="1.0" encoding="utf-8"?>
<Solver>
  <defaults>
    <Default property="ImageCodec.format" 
             object="{ImageFormat encoding='png' channelFormat='gray'/}"/>
  </defaults>
  <items>
    <Function name="mnistClassifier" 
              templateUri="{Uri string='mnistClassifier.xml'/}"
              labelsInput="trainingDb.labels"/>
    <Database name="trainingDb" connection="trainingDb"
              source="{Mnist dataSet='training'/}"/>
  </items>
</Solver>

Data Selection

To retrieve data from the Database element directly, the forward function is called with a Selection. For example:

do {
  let fileName = "samples/mnist/mnistClassifierSolver"
  guard let path = Bundle.main.path(forResource: fileName, ofType: "xml") else {
    fatalError("Resource not found: \(fileName)")
  }
  let model = try Model(contentsOf: URL(fileURLWithPath: path))
  try model.setup()
  let db = model.find(type: Database.self, name: "trainingDb")!
  db.streamOutput = true
  _ = try db.forward(mode: .inference, selection: Selection(count: 5))

  var labels = DataView()
  try db.copyOutput(connectorName: "labels", to: &labels)
  print(labels.format())
} catch {
  print(String(describing: error))
}

The copyOutput function is needed to gather distributed data in the multi GPU case.

Compute Service

Netlib defines a compute service abstraction, which encapsulates functionality to allow run time selection of the compute service API (cpu, cuda, metal, etc.) and hardware device to simplify portability.

ComputePlatform

The ComputePlatform class is used to select a compute service (cpu, cuda, metal, etc.), hardware device, and to specify a default device. The ComputePlatform class is also used to detect and load compute service plugins that are implemented in a separate bundle. The Linux Foundation library doesn’t have loadable bundles yet.

ComputeService

A compute service implements the ComputeService protocol and is used to enumerate available devices.

ComputeDevice

A compute device implements the ComputeDevice protocol and is used to query device attributes, and create resources such as device arrays and streams.

DeviceStream

A device stream is an abstraction for an asynchronous command queue that implements the DeviceStream protocol. It is used to schedule and synchronize computations. The protocol function implementations are service API specific and optimized.

DeviceArray

A device array implements the DeviceArray protocol and is an abstraction for a contiguous array of bytes on a device.

StreamEvent

A stream event implements the StreamEvent protocol and is used to synchronize device streams.

Computable

A computable represents a high level operation provided by a service API, which might involve a significant number of function calls to setup and execute. They are used to encapsulate function groups like those found in cuDNN. The CudaComputeService exposes computables for most of the cuDNN operations. Currently implemented computables:

Activation
Convolution
Dropout
FullyConnected
LrnCrossChannel
Pooling
Softmax
BatchNormalize (needs testing)

Compatibility Considerations

The current design anticipated that the underlying parallel API would either be synchronous (CPU), or an asynchronous stream oriented API like Cuda, OpenCL, or Metal. CoreML relies on a blend of Metal and other libraries. I haven’t taken the time yet to consider how to best interact with CoreML. Currently the only compute service implemented is CudaComputeService.

Multi GPU Support

Netlib is designed to use multiple GPUs. There is a fair amount of code currently in place, but this feature is currently still in progress.

Data

Netlib has a flexible model for creating and accessing data arrays.

DataType

Frameworks based on C++ use templates to statically compile a pipeline of functions for specific sample data types such as Float or Double. As the number of data types increase, it can lead to template code bloat.

Netlib uses dynamic typing for all data arrays. All elements (e.g. Convolution, FullyConnected, etc.) are configured during the setup phase with the types needed to process the data type of their inputs. Once setup, the configuration of the computational graph remains constant. If properties are changed, then setup will be run again to reconfigure the graph appropriately.

The currently defined data types are

public enum DataType : Int, AnyConvertible, BinarySerializable {
  case real8U, real16U, real16I, real32I, real16F, real32F, real64F
}

Precompiled C++ templates should be faster, but the performance benchmark listed earlier makes it clear that using dynamic data types do not create significant overhead.

Data Shape

A data Shape is a struct describing the arrangement and interpretation of a DataArray.
It's attributes are:

extent – the N dimensional extent of the space
layout – the layout of memory channels (vector, matrix, nhwc, nchw, nchw_vector_c)
channelFormat - any, gray, grayAlpha, rgb, rgba. This can easily be extended to cover other types such as audio. The type “any” is used for data that has no formal structure.
strides – the strides for each extent. If not specified, it is assumed the data is packed row major and the strides are calculated for the caller
colMajor – used to specify if the data is arranged in column major format to simplify interchange. The default is row major.

DataArray

A DataArray is a class object that represents a contiguous fixed size typed linear array. No data space is actually allocated until the first access is made.

DataArray Diagram

DataView

A DataView is a struct that presents a shaped view of an associated DataArray, along with a variety of access functions. Creation of a DataView will automatically create a DataArray if one is not specified.

Sub Views and Multi-threading

DataView has methods to create sub views of an array. An important use of this is to divide a DataArray into multiple sub-regions and operate on them in parallel. For example: when a batch of 64 image items is selected from the Database, they are all decoded and type converted on independent threads making full use of all cores on the host CPU. Operating on sub-regions in parallel on the GPU works the same way. Synchronization between sub views is managed by the caller.

DataArray Diagram

Data views manage both shared and copy-on-write semantics. Multiple concurrent views can be created to support multi-threaded access. The caller manages synchronization between views.

Examining Arrays

The DataView object has a format function, which can be used to print the contents of a DataArray.

public func format(columnWidth: Int? = nil,
                   precision: Int? = nil,
                   maxItems: Int = Int.max,
          
8000
         maxCols: Int = 10,
                   highlightThreshold: Float) -> String

To print the first 3 items

var data = DataView(extent: [5, 1, 4, 3])
try cpuFillUniform(data: &data, range: 0...1)
// print 2
print(data.format(maxItems: 2))

// print all
print(data.format())

Example output for the first 2 items

DataView extent [5, 1, 4, 3]
   item [0] ======================================
channel [0] -------------------
[0]  0.449824 0.199999 0.927697
[1]  0.623079 0.795340 0.570338
[2]  0.913603 0.702925 0.454706
[3]  0.311121 0.693442 0.554390

   item [1] ======================================
channel [0] -------------------
[0]  0.461374 0.064473 0.186158
[1]  0.690513 0.996623 0.811923
[2]  0.319765 0.532537 0.697510
[3]  0.497443 0.396686 0.654059

Example output for an MNIST item, the number 3. Setting the highlightThreshold parameter to 0, will make all of the 1's show up blue in the debugger output to make it much easier to see. Markdown doesn't support color text here.

DataView extent [64, 1, 28, 28]
   item [0] ======================================
channel [0] -------------------
 [0]  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [1]  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [2]  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [3]  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [4]  0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
 [5]  0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
 [6]  0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
 [7]  0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
 [8]  0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0
 [9]  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0
[10]  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0
[11]  0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0
[12]  0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
[13]  0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
[14]  0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
[15]  0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0
[16]  0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0
[17]  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0
[18]  0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0
[19]  0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0
[20]  0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
[21]  0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
[22]  0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
[23]  0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[24]  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[25]  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[26]  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[27]  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Data Access

There are a variety of convenience methods on the DataView to obtain access to the associated data. There are explicitly typed functions and generics, such as:

public func roReal32F() throws -> UnsafeBufferPointer<Float>
public func ro<T: AnyNumber>(type: T.Type) throws -> UnsafeBufferPointer<T>

and some return a raw pointer to simplify access to C accelerator APIs such as Cuda.

public mutating func rw(using stream: DeviceStream) throws -> UnsafeMutableRawPointer

A DeviceStream is an abstraction for an asynchronous instruction queue. Optionally a device stream can be specified to make the data available on the associated device. If no stream is specified, the host is assumed.

Discrete Memory Replication

When using Unified memory access (UMA) devices, the DataArray class maintains a single shared version of a DeviceArray. However when using discrete memory devices, data must be copied between devices to maintain a coherent user view of the data.

When the caller takes a pointer, the DataArray object automatically allocates a DeviceArray if needed on the device associated with the specified DeviceStream.

The DataArray maintains a replication list of DeviceArray objects and version numbers for each. When a read-write pointer is taken, that device array becomes the new master. When any pointer is taken, if the target device array’s version does not match the master version, data is automatically copied so the target and master are synchronized.

No copying is required if multiple accesses in a row are made to the same DeviceArray on the same stream device. Since memory is not allocated until the first access, data is only allocated where it is used. So temporary GPU device arrays will not cause any memory to be shadowed on the host or other devices.

GPUs with paging hardware can be considered a UMA device, however so far performance tests have indicated that it is still faster to manage the memory through explicit asynchronous copy operations.

Logging and Diagnostics

The Log class is used to record diagnostic information during setup and model execution. The user can set the LogLevel to obtain the desired amount of information.

public enum LogLevel : Int, AnyConvertible, Comparable {
  case error, warning, status, diagnostic
}

To set the model's log level

let model = Model()
model.log.logLevel = .status

Diagnostic messages specify their LogCategory(s) when reporting. If they match the currently desired log categories, the message is written to the output. The current diagnostic categories are:

public struct LogCategories: OptionSet, AnyConvertible {
  public let rawValue: Int
  public static let connections       = LogCategories(rawValue: 1 << 0)
  public static let dataAlloc         = LogCategories(rawValue: 1 << 1)
  public static let dataCopy          = LogCategories(rawValue: 1 << 2)
  public static let dataMutation      = LogCategories(rawValue: 1 << 3)
  public static let defaultsLookup    = LogCategories(rawValue: 1 << 4)
  public static let evaluate          = LogCategories(rawValue: 1 << 5)
  public static let setup             = LogCategories(rawValue: 1 << 7)
  public static let setupBackward     = LogCategories(rawValue: 1 << 8)
  public static let setupForward      = LogCategories(rawValue: 1 << 9)
  public static let streamAlloc       = LogCategories(rawValue: 1 << 10)
  public static let streamSync        = LogCategories(rawValue: 1 << 11)
  public static let context           = LogCategories(rawValue: 1 << 12)
  public static let tryDefaultsLookup = LogCategories(rawValue: 1 << 13)
  public static let download          = LogCategories(rawValue: 1 << 14)
}

Data Diagnostic Example

If you want to watch data copies and mutations during execution, then set the logging to

let model = Model {
  $0.log.logLevel = .diagnostic
  $0.log.categories = [.dataCopy, .dataMutation]
}

The output contains detailed information such as object tracking id, source and destination device/stream ids, and the number of bytes being copied.

status    :   default device: TITAN Xp   id: cuda.0
status    :   ******************************************************************
status    :   Begin training: mnistClassifier
diagnostic: [COPY   ] model.items.solver(0).items.trainingDb(398) host ---> d0_s0 elements: 47065088
diagnostic: [COPY   ] model.items.solver(0).items.trainingDb(399) host ---> d0_s0 elements: 60032
diagnostic: [COPY   ] model.items.solver(0).items.mnistClassifier.items.convolution(7).weights(300566) host ---> d0_s0 elements: 500
diagnostic: [COPY   ] model.items.solver(0).items.mnistClassifier.items.convolution(5).weights(300596) host ---> d0_s0 elements: 25000
diagnostic: [COPY   ] model.items.solver(0).items.mnistClassifier.items.fullyconnected(3).weights(300626) host ---> d0_s0 elements: 400000
diagnostic: [COPY   ] model.items.solver(0).items.mnistClassifier.items.fullyconnected(1).weights(300651) host ---> d0_s0 elements: 5000
diagnostic: [COPY   ] model.items.solver(0).tests.test(0).items.validationData.items.validationDb(300888) host ---> d0_s6 elements: 7840000
diagnostic: [COPY   ] model.items.solver(0).tests.test(0).items.validationData.items.validationDb(300889) host ---> d0_s6 elements: 10000
status    :   Test accuracy:     7.70%  epoch:  1.001  func: validationData
diagnostic:   [MUTATE ] model.items.solver(0).items.mnistClassifier.items.fullyconnected(1).bias(300653)  elements: 10
diagnostic:   [MUTATE ] model.items.solver(0).items.mnistClassifier.items.convolution(7).bias(300569)  elements: 20
diagnostic:   [MUTATE ] model.items.solver(0).items.mnistClassifier.items.fullyconnected(1).weights(300651)  elements: 5000
diagnostic:   [MUTATE ] model.items.solver(0).items.mnistClassifier.items.fullyconnected(3).bias(300628)  elements: 500
diagnostic:   [MUTATE ] model.items.solver(0).items.mnistClassifier.items.convolution(7).weights(300566)  elements: 500
diagnostic:   [MUTATE ] model.items.solver(0).items.mnistClassifier.items.convolution(5).weights(300596)  elements: 25000
diagnostic:   [MUTATE ] model.items.solver(0).items.mnistClassifier.items.convolution(5).bias(300599)  elements: 50
diagnostic:   [MUTATE ] model.items.solver(0).items.mnistClassifier.items.fullyconnected(3).weights(300626)  elements: 400000
diagnostic:   [MUTATE ] model.items.solver(0).items.mnistClassifier.items.fullyconnected(1).bias(351006)  elements: 10
diagnostic:   [MUTATE ] model.items.solver(0).items.mnistClassifier.items.convolution(7).bias(351011)  elements: 20
diagnostic:   [MUTATE ] model.items.solver(0).items.mnistClassifier.items.fullyconnected(1).weights(351016)  elements: 5000
diagnostic:   [MUTATE ] model.items.solver(0).items.mnistClassifier.items.fullyconnected(3).bias(351021)  elements: 500
diagnostic:   [MUTATE ] model.items.solver(0).items.mnistClassifier.items.convolution(7).weights(351026)  elements: 500
diagnostic:   [MUTATE ] model.items.solver(0).items.mnistClassifier.items.convolution(5).weights(351031)  elements: 25000
diagnostic:   [MUTATE ] model.items.solver(0).items.mnistClassifier.items.convolution(5).bias(351036)  elements: 50
diagnostic:   [MUTATE ] model.items.solver(0).items.mnistClassifier.items.fullyconnected(3).weights(351041)  elements: 400000
status    :   Test accuracy:    95.50%  epoch:  1.534  func: validationData:1

Connections Diagnostic Example

To examine the connections for the final flattened computational graph that will be executed

let model = Model {
  $0.log.logLevel = .diagnostic
  $0.log.categories = [.connections]
}
let modelName = "samples/mnist/mnistClassifierSolver"
guard let path = Bundle.main.path(forResource: modelName, ofType: "xml") else { exit(1) }
try model.load(contentsOf: URL(fileURLWithPath: path))
try model.setup()

This is the connections diagnostic output for the MNIST model with Solver and training databases. Note the full namespace path is shown for containers.

diagnostic:     container connections ------------------------------------------
diagnostic:     model.items.solver(0).items.mnistClassifier
diagnostic:       [.labels] --> [.labelsInput] (pass through)
diagnostic:       [.data] --> softmax(0).data
diagnostic:       • softmax(0).input --> fullyconnected(1).data
diagnostic:       • fullyconnected(1).input --> activation(2).data
diagnostic:       • activation(2).input --> fullyconnected(3).data
diagnostic:       • fullyconnected(3).input --> pooling(4).data
diagnostic:       • pooling(4).input --> convolution(5).data
diagnostic:       • convolution(5).input --> pooling(6).data
diagnostic:       • pooling(6).input --> convolution(7).data
diagnostic:       • convolution(7).input --> [.input]
diagnostic:       • softmax(0).labelsInput --> [.labelsInput]
status    :   default device: TITAN Xp   id: cuda.0
diagnostic:         
diagnostic:         container connections --------------------------------------
diagnostic:         model.items.solver(0).tests.test(0).items.validationData.items.function(1)
diagnostic:           [.labels] --> [.labelsInput] (pass through)
diagnostic:           [.data] --> softmax(0).data
diagnostic:           • softmax(0).input --> fullyconnected(1).data
diagnostic:           • fullyconnected(1).input --> activation(2).data
diagnostic:           • activation(2).input --> fullyconnected(3).data
diagnostic:           • fullyconnected(3).input --> pooling(4).data
diagnostic:           • pooling(4).input --> convolution(5).data
diagnostic:           • convolution(5).input --> pooling(6).data
diagnostic:           • pooling(6).input --> convolution(7).data
diagnostic:           • convolution(7).input --> [.input]
diagnostic:           • softmax(0).labelsInput --> [.labelsInput]
diagnostic:       
diagnostic:       container connections ----------------------------------------
diagnostic:       model.items.solver(0).tests.test(0).items.validationData
diagnostic:         [.labels] --> [.labelsInput] (pass through)
diagnostic:         [.data] --> accuracy(0).data
diagnostic:         • accuracy(0).input --> function(1).data
diagnostic:             direct connect: accuracy(0).input --> function(1).softmax(0).data
diagnostic:         • function(1).input --> validationDb.data
diagnostic:             direct connect: function(1).convolution(7).input --> validationDb.data
diagnostic:         • function(1).labelsInput --> validationDb.labels
diagnostic:             direct connect: function(1).softmax(0).labelsInput --> validationDb.labels
diagnostic:         • accuracy(0).labelsInput --> validationDb.labels
diagnostic:     
diagnostic:     container connections ------------------------------------------
diagnostic:     model.items.solver(0).tests.test(0)
diagnostic:       [.labels] --> [.labelsInput] (pass through)
diagnostic:       [.data] --> validationData.data
diagnostic:       • validationData.input --> [test(0).input]
diagnostic:       • validationData.labelsInput --> [test(0).labelsInput]
diagnostic:   
diagnostic:   container connections --------------------------------------------
diagnostic:   model.items.solver(0)
diagnostic:     [.labels] --> trainingDb.labels (redirected)
diagnostic:     [.data] --> mnistClassifier.data
diagnostic:     • mnistClassifier.input --> trainingDb.data
diagnostic:         direct connect: mnistClassifier.convolution(7).input --> trainingDb.data
diagnostic:     • mnistClassifier.labelsInput --> trainingDb.labels
diagnostic:         direct connect: mnistClassifier.softmax(0).labelsInput --> trainingDb.labels
diagnostic: 
diagnostic: container connections ----------------------------------------------
diagnostic: model
diagnostic:   [.labels] --> [.labelsInput] (pass through)
diagnostic:   [.data] --> solver(0).data
diagnostic:   • solver(0).input --> [model.input]
diagnostic:   • solver(0).labelsInput --> [model.labelsInput]

Object Tracking

The ObjectTracker singleton class is used to track init/deinit for all class objects. On init, all objects register and receive a unique tracking Id. This is used to report unreleased object instances in detail to simplify finding potential retain cycles.

The debuggerRegisterBreakId and debuggerRemoveBreakId member variables can be set to cause LLDB to break when hit. The getActiveObjectInfo function can be called to dump a list of active objects with detailed information to simplify identifying them.

Overview
Installation
Performance
Language Choice
Creating Models
Compute Service
- Multi GPU Support
Data
Logging and Diagnostics
- Data Diagnostic Example
- Connections Diagnostic Example
Object Tracking