Pipelines is a language and runtime for crafting massively parallel pipelines. Unlike other languages for defining data flow, the Pipeline language requires implementation of components to be defined separately in the Python scripting language. This allows the details of implementations to be separated from the structure of the pipeline, while providing access to thousands of active libraries for machine learning, data analysis and processing. Skip to Getting Started to install the Pipeline compiler.
As an introductory example, a simple pipeline for Fizz Buzz on even numbers could be written as follows -
from fizzbuzz import numbers
from fizzbuzz import even
from fizzbuzz import fizzbuzz
from fizzbuzz import printer
numbers
/> even
|> fizzbuzz where (number=*, fizz="Fizz", buzz="Buzz")
|> printer
Meanwhile, the implementation of the components would be written in Python -
def numbers():
for number in range(1, 100):
yield number
def even(number):
return number % 2 == 0
def fizzbuzz(number, fizz, buzz):
if number % 15 == 0: return fizz + buzz
elif number % 3 == 0: return fizz
elif number % 5 == 0: return buzz
else: return number
def printer(number):
print(number)
Running the Pipeline document would safely execute each component of the pipeline in parallel and output the expected result.
Components are scripted in Python and linked into a pipeline using imports. The syntax for an import has 3 parts - (1) the path to the module, (2) the name of the function, and (3) the alias for the component. Here's an example -
from parser import parse_fasta as parse
That's really all there is to imports. Once a component is imported it can be referenced anywhere in the document with the alias.
Every pipeline is operated on a stream of data. The stream of data is created by a Python generator. The following is an example of a generator that generates a stream of numbers from 0 to 1000.
def numbers():
for number in range(0, 1000):
yield number
Here's a generator that reads entries from a file
def customers():
for line in open("customers.csv", 'r'):
yield line
The first component in a pipeline is always the generator. The generator is run in parallel with all other components and each element of data is passed through the other components.
from utils import customers as customers # a generator function in the utils module
from utils import parse_row as parser
from utils import get_recommendations as recommender
from utils import print_recommendations as printer
customers |> parser |> recommender |> printer