PySpark Demo with Search

This is based off of Jon Haddad's Spark Training in Notebook form. It's an awesome training guide and I highly recommend using it

This demo is built to work with DataStax Entereprise 4.8+ (relies on Spark 1.3 Dataframes)

Virtual Env

These steps are based on using virtualenv. It is highly recommended to use virtualenv as it keeps packages separate between apps. Here's a good intro.
Install virtualenv

    sudo pip install virtualenv

Enter the directory of your application and create the virtualenv in the app directory (the name env is the standard)

    mkdir /var/www/at_pyspark
    cd /var/www/at_pyspark
    git clone someURL .
    virtualenv env
    source env/bin/activate
        # If you're using fish shell (like I am): 
        source env/bin/activate.fish

You are now in the virtualenv, your prompt should reflect this, and are ready to install other python packages.
Type deactivate to exit the active virtualenv.

Setup

Start DSE in SearchAnalytics mode
- dse cassandra -s -k
- Spark Web UI: http://localhost:7080/
- SOLR Web UI: http://localhost:8983/solr/

Run ./setup.sh

This will:

Download and unzip the movie lens dataset.
Install the Python Requirements
Create the required keyspace, tables, and load movie data for the exercises

If you want to search the data:

dsetool create_core at_pyspark.movie generateResources=true reindex=true
Enable
- Get the current solrconfig.xml
- curl -o solrconfig.xml http://localhost:8983/solr/at_pyspark.movie/admin/file?file=solrconfig.xml&contentType=text/xml;charset=utf-8
- dsetool reload_core at_pyspark.movie reindex=true solrconfig=/Users/adamtourkow/ds/demos/at-pyspark/solrconfig.xml

Running the demo

Run the Ratings Generator Feed
- ipython 01.ratings_generator.py
--- The below is incomplete ---
Run the Spark Submit Stream
- dse spark-submit 02.spark_stream.py

TODO

Add recommendations
Add Faceted Search
Add GeoSpatial

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
setup		setup
tests		tests
.gitignore		.gitignore
01.ratings_generator.py		01.ratings_generator.py
02.spark_stream copy.py		02.spark_stream copy.py
02.spark_stream.py		02.spark_stream.py
README.md		README.md
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PySpark Demo with Search

Virtual Env

Setup

Running the demo

TODO

About

Uh oh!

Releases

Packages

Languages

atourkow/datastax-at-pyspark

Folders and files

Latest commit

History

Repository files navigation

PySpark Demo with Search

Virtual Env

Setup

Running the demo

TODO

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages