Tags · hwhmusic/spark

cdh6.3.3-release

[SPARK-24540][SQL] Support for multiple character delimiter in Spark …

…CSV read

Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters

Moving univocity-parsers version to spark-parent pom dependencyManagement section

Adding new utility method to build multi-char delimiter string, which delegates to existing one

Adding tests for multiple character delimited CSV

Adds support for parsing CSV data using multiple-character delimiters. Existing logic for converting the input delimiter string to characters was kept and invoked in a loop. Project dependencies were updated to remove redundant declaration of `univocity-parsers` version, and also to change that version to the latest.

It is quite common for people to have delimited data, where the delimiter is not a single character, but rather a sequence of characters. Currently, it is difficult to handle such data in Spark (typically needs pre-processing).

Yes. Specifying the "delimiter" option for the DataFrame read, and providing more than one character, will no longer result in an exception. Instead, it will be converted as before and passed to the underlying library (Univocity), which has accepted multiple character delimiters since 2.8.0.

The `CSVSuite` tests were confirmed passing (including new methods), and `sbt` tests for `sql` were executed.

Closes #26027 from jeff303/SPARK-24540.

Authored-by: Jeff Evans <jeffrey.wayne.evans@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
(cherry picked from commit 95de93b)

Cloudera ID: CDPD-6528

(cherry picked from commit 1700092963ba75bef4e56e06d60d6c0cf9771c42)
(cherry picked from commit 291990725d2cf01dda81a80a6eaea6858bae5ea2)

Change-Id: I3113116e27f8be58c7835c7a15f3f4805a42a7c0

Jan 9, 2020
0929f51
zip
tar.gz

cdh6.3.2-release

CLOUDERA-BUILD. Preparing for 6.3.2 release.

Commit performed on https://master-01.jenkins.cloudera.com/

Signed-off-by: cloudera <cauldron-dev@cloudera.com>

Nov 6, 2019
2168d52
zip
tar.gz

cdh6.3.1-release

CLOUDERA-BUILD. Preparing for 6.3.1 release.

Commit performed on https://master-01.jenkins.cloudera.com/

Signed-off-by: cloudera <cauldron-dev@cloudera.com>

Sep 18, 2019
5c499c7
zip
tar.gz

cdh6.2.1-release

CLOUDERA-BUILD. Preparing for 6.2.1 release.

Commit performed on https://master-01.jenkins.cloudera.com/

Signed-off-by: cloudera <cauldron-dev@cloudera.com>

Aug 28, 2019
def8dcf
zip
tar.gz

cdh6.3.0-release

CDH-80997. Handle nulls appropriately in SparkRackResolver.

Scala 2.11's Java collection converters don't handle null well, so
the code that works fine on Scala 2.12 throws an NPE when run on
2.11. So avoid the conversion to Scala types unless we know that
the Java collection is not null.

Change-Id: Ie64d47a55fa19af33759ab9aaac07c82c190b630
(cherry picked from commit 6ab8d542b92e605c813737c53571fe3da3d50aa6)

Jun 27, 2019
2c75969
zip
tar.gz

spark2-2.4.0-cloudera2

Spark 2.4.2 GA Release

Apr 29, 2019
d87dd9b
zip
tar.gz

cdh5.16.2-release

Branching for 5.16.2 on Fri Apr 12 04:16:37 PDT 2019

JOB_NAME : 'Cut-Release-Branches'
BUILD_NUMBER : '579'
CODE_BRANCH : ''
OLD_CDH_BRANCH : 'cdh5_5.16.x'

Pushed to remote origin	git://github.infra.cloudera.com/CDH/spark.git (push)

Apr 12, 2019
841c04d
zip
tar.gz

spark2-2.4.0-cloudera1

CDH-78338. Support multihoming by introducing spark.executor.rpc.bind…

…ToAll

(cherry picked from commit b072bfc)

Apr 3, 2019
66412a8
zip
tar.gz

cdh6.2.0-release

CLOUDERA-BUILD. Preparing for 6.2.0 release.

Commit performed on https://master-01.jenkins.cloudera.com/

Signed-off-by: cloudera <cauldron-dev@cloudera.com>

Feb 28, 2019
666a7c2
zip
tar.gz

cdh6.1.1-release

[SPARK-26680][SQL] Eagerly create inputVars while conditions are appr…

…opriate

When a user passes a Stream to groupBy, ```CodegenSupport.consume``` ends up lazily generating ```inputVars``` from a Stream, since the field ```output``` will be a Stream. At the time ```output.zipWithIndex.map``` is called, conditions are correct. However, by the time the map operation actually executes, conditions are no longer appropriate. The closure used by the map operation ends up using a reference to the partially created ```inputVars```. As a result, a StackOverflowError occurs.

This PR ensures that ```inputVars``` is eagerly created while conditions are appropriate. It seems this was also an issue with the code path for creating ```inputVars``` from ```outputVars``` (SPARK-25767). I simply extended the solution for that code path to encompass both code paths.

SQL unit tests
new test
python tests

Closes #23617 from bersprockets/SPARK-26680_opt1.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
(cherry picked from commit d4a30fa)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
(cherry picked from commit e8e9b11)

Cloudera ID: CDH-77168

Change-Id: I521aa05ca9d6d8e7766977419af89ea113c62485

Jan 28, 2019
1dd6c50
zip
tar.gz

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly