8000 Tags · hwhmusic/spark · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Tags: hwhmusic/spark

Tags

cdh6.3.3-release

Toggle cdh6.3.3-release's commit message
[SPARK-24540][SQL] Support for multiple character delimiter in Spark …

…CSV read

Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters

Moving univocity-parsers version to spark-parent pom dependencyManagement section

Adding new utility method to build multi-char delimiter string, which delegates to existing one

Adding tests for multiple character delimited CSV

Adds support for parsing CSV data using multiple-character delimiters.  Existing logic for converting the input delimiter string to characters was kept and invoked in a loop.  Project dependencies were updated to remove redundant declaration of `univocity-parsers` version, and also to change that version to the latest.

It is quite common for people to have delimited data, where the delimiter is not a single character, but rather a sequence of characters.  Currently, it is difficult to handle such data in Spark (typically needs pre-processing).

Yes. Specifying the "delimiter" option for the DataFrame read, and providing more than one character, will no longer result in an exception.  Instead, it will be converted as before and passed to the underlying library (Univocity), which has accepted multiple character delimiters since 2.8.0.

The `CSVSuite` tests were confirmed passing (including new methods), and `sbt` tests for `sql` were executed.

Closes #26027 from jeff303/SPARK-24540.

Authored-by: Jeff Evans <jeffrey.wayne.evans@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
(cherry picked from commit 95de93b)

Cloudera ID: CDPD-6528

(cherry picked from commit 1700092963ba75bef4e56e06d60d6c0cf9771c42)
(cherry picked from commit 291990725d2cf01dda81a80a6eaea6858bae5ea2)

Change-Id: I3113116e27f8be58c7835c7a15f3f4805a42a7c0

cdh6.3.2-release

Toggle cdh6.3.2-release's commit message
CLOUDERA-BUILD. Preparing for 6.3.2 release.

Commit performed on https://master-01.jenkins.cloudera.com/

Signed-off-by: cloudera <cauldron-dev@cloudera.com>

cdh6.3.1-release

Toggle cdh6.3.1-release's commit message
CLOUDERA-BUILD. Preparing for 6.3.1 release.

Commit performed on https://master-01.jenkins.cloudera.com/

Signed-off-by: cloudera <cauldron-dev@cloudera.com>

cdh6.2.1-release

Toggle cdh6.2.1-release's commit message
CLOUDERA-BUILD. Preparing for 6.2.1 release.

Commit performed on https://master-01.jenkins.cloudera.com/

Signed-off-by: cloudera <cauldron-dev@cloudera.com>

cdh6.3.0-release

Toggle cdh6.3.0-release's commit message
CDH-80997. Handle nulls appropriately in SparkRackResolver.

Scala 2.11's Java collection converters don't handle null well, so
the code that works fine on Scala 2.12 throws an NPE when run on
2.11. So avoid the conversion to Scala types unless we know that
the Java collection is not null.

Change-Id: Ie64d47a55fa19af33759ab9aaac07c82c190b630
(cherry picked from commit 6ab8d542b92e605c813737c53571fe3da3d50aa6)

spark2-2.4.0-cloudera2

Toggle spark2-2.4.0-cloudera2's commit message
Spark 2.4.2 GA Release

cdh5.16.2-release

Toggle cdh5.16.2-release's commit message
Branching for 5.16.2 on Fri Apr 12 04:16:37 PDT 2019

JOB_NAME : 'Cut-Release-Branches'
BUILD_NUMBER : '579'
CODE_BRANCH : ''
OLD_CDH_BRANCH : 'cdh5_5.16.x'

Pushed to remote origin	git://github.infra.cloudera.com/CDH/spark.git (push)

spark2-2.4.0-cloudera1

Toggle spark2-2.4.0-cloudera1's commit message
CDH-78338. Support multihoming by introducing spark.executor.rpc.bind…

…ToAll

(cherry picked from commit b072bfc)

cdh6.2.0-release

Toggle cdh6.2.0-release's commit message
CLOUDERA-BUILD. Preparing for 6.2.0 release.

Commit performed on https://master-01.jenkins.cloudera.com/

Signed-off-by: cloudera <cauldron-dev@cloudera.com>

cdh6.1.1-release

Toggle cdh6.1.1-release's commit message
[SPARK-26680][SQL] Eagerly create inputVars while conditions are appr…

…opriate

When a user passes a Stream to groupBy, ```CodegenSupport.consume``` ends up lazily generating ```inputVars``` from a Stream, since the field ```output``` will be a Stream. At the time ```output.zipWithIndex.map``` is called, conditions are correct. However, by the time the map operation actually executes, conditions are no longer appropriate. The closure used by the map operation ends up using a reference to the partially created ```inputVars```. As a result, a StackOverflowError occurs.

This PR ensures that ```inputVars``` is eagerly created while conditions are appropriate. It seems this was also an issue with the code path for creating ```inputVars``` from ```outputVars``` (SPARK-25767). I simply extended the solution for that code path to encompass both code paths.

SQL unit tests
new test
python tests

Closes #23617 from bersprockets/SPARK-26680_opt1.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
(cherry picked from commit d4a30fa)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
(cherry picked from commit e8e9b11)

Cloudera ID: CDH-77168

Change-Id: I521aa05ca9d6d8e7766977419af89ea113c62485
0