10000 GitHub - ericgarcia/avro_pyspark: Read Avro Files into Pyspark
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

ericgarcia/avro_pyspark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

To build this, you need to first check out the latest spark source https://github.com/apache/spark and run sbt/sbt publish-local. This will put spark in maven's local repository.

You will also want to build this version of spark with ./make-distribution.sh --tgz

Then from this repository, run mvn package to create avro-0.1.jar.

To start up pyspark with with this library, run SPARK_CLASSPATH=/tmp/avro-0.1.jar bin/pyspark from the distribution you created above.

To load in an Avro file in pyspark use:

avroRdd = sc.newAPIHadoopFile("/tmp/data.avro", 
  "org.apache.avro.mapreduce.AvroKeyInputFormat", 
  "org.apache.avro.mapred.AvroKey", 
  "org.apache.hadoop.io.NullWritable",
  keyConverter="example.AvroGenericConverter")

About

Read Avro Files into Pyspark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0