MAPREDUCE-7403. manifest-committer dynamic partitioning support. #4728

steveloughran · 2022-08-10T10:47:46Z

Description of PR

Declares its compatibility with the stream capability
"mapreduce.job.committer.dynamic.partitioning"

spark will need to cast to StreamCapabilities and then probe.

How was this patch tested?

SPARK-40034 has a PR with patch matching changes in the spark code; plus unit tests to verify that it's not an error to ask for dynamic partition if the committer's hasCapability holds.
apache/spark#37468

Testing

all the abfs tests against azure cardiff. one transient failure; one new JIRA (HADOOP-18405
abfs testReadAndWriteWithDifferentBufferSizesAndSeek failure)
unit tests of the spark code in to check for the capability and reject if missing CommitterBindingSuite.scala#L162
new integration tests in https://github.com/hortonworks-spark/cloud-integration

those new integration tests include

spark sql test derived from spark's own CloudRelationBasicSuite.scala#L212
Dataset tests extended to verify support for/rejection of dynamic partition overwrite AbstractCommitDataframeSuite.scala#L151

Tested through a spark build with the matching patch against s3 london, azure cardiff.
GCS test setup failing with oauth problems the way they were not on friday. assuming unrelated.

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

Declares its compatibility with the stream capability "mapreduce.job.committer.dynamic.partitioning" spark will need to cast to StreamCapabilities and then probe. Change-Id: Iafcacc6d2491bb1e7fc2fc033c6d17d5b63b5b4f

…through Change-Id: Icc30bf6251977cfb76211bffcfc5796b1a44989b

* spark-side requirements * why there is risk if you use it at scale. That risk is low because currently spark seems to rename sequentially. if/when it does parallel file rename then throttling may be triggered, with the consequential failure events. Change-Id: I6e442bbdcaa007a3cd2e04ddf8b41d14c51057ff

Change-Id: I423f052ca48915502f182cb4f1c67cdf04838a99

hadoop-yetus · 2022-08-17T14:29:05Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 55s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+0 🆗	markdownlint	0m 1s		markdownlint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+1 💚	mvninstall	40m 58s		trunk passed
+1 💚	compile	0m 56s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	compile	0m 49s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	checkstyle	0m 50s		trunk passed
+1 💚	mvnsite	0m 56s		trunk passed
+1 💚	javadoc	0m 44s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	0m 35s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	1m 45s		trunk passed
+1 💚	shadedclient	24m 16s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 39s		the patch passed
+1 💚	compile	0m 44s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javac	0m 44s		the patch passed
+1 💚	compile	0m 37s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	javac	0m 37s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	0m 31s		the patch passed
+1 💚	mvnsite	0m 42s		the patch passed
+1 💚	javadoc	0m 23s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	0m 23s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	1m 34s		the patch passed
+1 💚	shadedclient	23m 44s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	7m 12s		hadoop-mapreduce-client-core in the patch passed.
+1 💚	asflicense	0m 42s		The patch does not generate ASF License warnings.
		110m 8s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4728/4/artifact/out/Dockerfile
GITHUB PR	#4728
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname	Linux 776b57f21b6a 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `25db5da`
Default Java	Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4728/4/testReport/
Max. process+thread count	1110 (vs. ulimit of 5500)
modules	C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4728/4/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

steveloughran · 2022-08-18T10:59:25Z

would be good for some reviews here from @mukund-thakur , @mehakmeet and ideally @sunchao and @dongjoon-hyun -both of whom will be able to review the matching spark-side change, which is simply one of "don't reject attempts to use a PathOutputCommitter for dynamic partition overwrite if the instance created says it is OK"

attilapiros

I think the new assertions TestManifestCommitProtocol.java are just defined but not executed. Otherwise LGTM.

attilapiros · 2022-08-18T20:31:34Z

...va/org/apache/hadoop/mapreduce/lib/output/committer/manifest/TestManifestCommitProtocol.java

+    Assertions.assertThat(committer.hasCapability(
+            ManifestCommitterConstants.CAPABILITY_DYNAMIC_PARTITIONING))
+        .describedAs("dynamic partitioning capability in committer %s",
+            committer);


Suggested change

committer);

committer).isTrue();

did i just get my asserts wrong. that was bad. thanks!

attilapiros · 2022-08-18T20:31:57Z

...va/org/apache/hadoop/mapreduce/lib/output/committer/manifest/TestManifestCommitProtocol.java

+    Assertions.assertThat(bindingCommitter.hasCapability(
+            ManifestCommitterConstants.CAPABILITY_DYNAMIC_PARTITIONING))
+        .describedAs("dynamic partitioning capability in committer %s",
+            bindingCommitter);


Suggested change

bindingCommitter);

bindingCommitter).isTrue();

mukund-thakur

LGTM +1,
pending Attila's comments.

Change-Id: I29e98cf4ac607913d59e15babe6180434f665714

steveloughran · 2022-08-22T15:45:10Z

thanks. fixed tests, ran locally, and ran the abfs ITest subclass. all good

mukund-thakur · 2022-08-22T19:26:33Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/AWSStatus500Exception.java

@@ -29,7 +29,7 @@
 *    <li>Nothing else got through either.</li>
 * </ol>
 */
-public class AWSStatus500Exception extends AWSServiceIOException {
+public class jAWSStatus500Exception extends AWSServiceIOException {


I think this is a typo in Intellij.

And I think Yetus failed beacuse of this only.

hadoop-yetus · 2022-08-22T19:33:31Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 48s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+0 🆗	markdownlint	0m 1s		markdownlint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	15m 8s		Maven dependency ordering for branch
+1 💚	mvninstall	28m 39s		trunk passed
+1 💚	compile	25m 5s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	compile	22m 1s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	checkstyle	4m 30s		trunk passed
+1 💚	mvnsite	2m 28s		trunk passed
+1 💚	javadoc	1m 56s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	1m 53s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	3m 38s		trunk passed
+1 💚	shadedclient	24m 30s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 27s		Maven dependency ordering for patch
-1 ❌	mvninstall	0m 20s	/patch-mvninstall-hadoop-tools_hadoop-aws.txt	hadoop-aws in the patch failed.
-1 ❌	compile	22m 55s	/patch-compile-root-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.txt	root in the patch failed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.
-1 ❌	javac	22m 55s	/patch-compile-root-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.txt	root in the patch failed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.
-1 ❌	compile	20m 48s	/patch-compile-root-jdkPrivateBuild-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07.txt	root in the patch failed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07.
-1 ❌	javac	20m 48s	/patch-compile-root-jdkPrivateBuild-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07.txt	root in the patch failed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07.
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	4m 19s	/results-checkstyle-root.txt	root: The patch generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
-1 ❌	mvnsite	0m 52s	/patch-mvnsite-hadoop-tools_hadoop-aws.txt	hadoop-aws in the patch failed.
+1 💚	javadoc	1m 46s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
-1 ❌	javadoc	0m 53s	/patch-javadoc-hadoop-tools_hadoop-aws-jdkPrivateBuild-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07.txt	hadoop-aws in the patch failed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07.
-1 ❌	spotbugs	0m 50s	/patch-spotbugs-hadoop-tools_hadoop-aws.txt	hadoop-aws in the patch failed.
+1 💚	shadedclient	25m 19s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	7m 26s		hadoop-mapreduce-client-core in the patch passed.
-1 ❌	unit	0m 50s	/patch-unit-hadoop-tools_hadoop-aws.txt	hadoop-aws in the patch failed.
+1 💚	asflicense	1m 14s		The patch does not generate ASF License warnings.
		227m 14s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4728/5/artifact/out/Dockerfile
GITHUB PR	#4728
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname	Linux a84745bf1c6e 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `649b902`
Default Java	Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4728/5/testReport/
Max. process+thread count	1069 (vs. ulimit of 5500)
modules	C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-aws U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4728/5/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Change-Id: Ifbe2d1012cbdf2e7467ce84a7d8d93a78e91dcf6

hadoop-yetus · 2022-08-23T16:27:29Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 47s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+0 🆗	markdownlint	0m 0s		markdownlint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	14m 45s		Maven dependency ordering for branch
+1 💚	mvninstall	28m 31s		trunk passed
+1 💚	compile	25m 15s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	compile	21m 52s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	checkstyle	4m 30s		trunk passed
+1 💚	mvnsite	2m 29s		trunk passed
+1 💚	javadoc	1m 56s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	1m 57s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	3m 42s		trunk passed
+1 💚	shadedclient	24m 28s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 27s		Maven dependency ordering for patch
+1 💚	mvninstall	1m 15s		the patch passed
+1 💚	compile	24m 35s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javac	24m 35s		the patch passed
+1 💚	compile	22m 4s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	javac	22m 4s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	4m 25s		the patch passed
+1 💚	mvnsite	2m 27s		the patch passed
+1 💚	javadoc	1m 49s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	1m 57s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	3m 50s		the patch passed
+1 💚	shadedclient	24m 47s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	7m 30s		hadoop-mapreduce-client-core in the patch passed.
+1 💚	unit	3m 5s		hadoop-aws in the patch passed.
+1 💚	asflicense	1m 17s		The patch does not generate ASF License warnings.
		233m 58s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4728/6/artifact/out/Dockerfile
GITHUB PR	#4728
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname	Linux 2b93cafc2124 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `bc9dfc9`
Default Java	Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4728/6/testReport/
Max. process+thread count	1082 (vs. ulimit of 5500)
modules	C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-aws U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4728/6/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

mukund-thakur

+1

Declares its compatibility with Spark's dynamic output partitioning by having the stream capability "mapreduce.job.committer.dynamic.partitioning" Requires a Spark release with SPARK-40034, which does the probing before deciding whether to accept/rejecting instantiation with dynamic partition overwrite set This feature can be declared as supported by any other PathOutputCommitter implementations whose algorithm and destination filesystem are compatible. None of the S3A committers are compatible. The classic FileOutputCommitter is, but it does not declare itself as such out of our fear of changing that code. The Spark-side code will automatically infer compatibility if the created committer is of that class or a subclass. Contributed by Steve Loughran.

### What changes were proposed in this pull request? Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a PathOutputCommitter is compatible with dynamic partition overwrite. This patch has unit tests but not integration tests; really needs to test the SQL commands through the manifest committer into gcs/abfs, or at least local fs. That would be possible once hadoop 3.3.5 is out... Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a PathOutputCommitter is compatible with dynamic partition overwrite. ### Why are the changes needed? Hadoop 3.3.5 adds a new committer in mapreduce-core which works fast and correctly on azure and gcs. (it would also work on hdfs, but its optimised for the cloud stores). The stores and the committer do meet the requirements of Spark SQL Dynamic Partition Overwrite, so it is OK to for spark to work through it. Spark does not know this; MAPREDUCE-7403 adds a way for any PathOutputCommitter to declare that they are compatible; the IntermediateManifestCommitter will do so. (apache/hadoop#4728) ### Does this PR introduce _any_ user-facing change? No. There is documentation on the feature in the hadoop [manifest committer](https://github.com/apache/hadoop/blob/82372d0d22e696643ad97490bc902fb6d17a6382/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/manifest_committer.md) docs. ### How was this patch tested? 1. Unit tests in hadoop-cloud which work with hadoop versions with/without the matching change. 2. New integration tests in https://github.com/hortonworks-spark/cloud-integration which require spark to be built against hadoop with the manifest committer declaring compatibility Those new integration tests include * spark sql test derived from spark's own [CloudRelationBasicSuite.scala#L212](https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/test/scala/org/apache/spark/sql/sources/CloudRelationBasicSuite.scala#L212) * Dataset tests extended to verify support for/rejection of dynamic partition overwrite [AbstractCommitDataframeSuite.scala#L151](https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/test/scala/com/cloudera/spark/cloud/committers/AbstractCommitDataframeSuite.scala#L151) Tested against azure cardiff with the manifest committer; s3 london (s3a committers reject dynamic partition overwrites) Closes #37468 from steveloughran/SPARK-40034-MAPREDUCE-7403-manifest-committer-partitioning. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…che#4728) Declares its compatibility with Spark's dynamic output partitioning by having the stream capability "mapreduce.job.committer.dynamic.partitioning" Requires a Spark release with SPARK-40034, which does the probing before deciding whether to accept/rejecting instantiation with dynamic partition overwrite set This feature can be declared as supported by any other PathOutputCommitter implementations whose algorithm and destination filesystem are compatible. None of the S3A committers are compatible. The classic FileOutputCommitter is, but it does not declare itself as such out of our fear of changing that code. The Spark-side code will automatically infer compatibility if the created committer is of that class or a subclass. Contributed by Steve Loughran.

### What changes were proposed in this pull request? Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a PathOutputCommitter is compatible with dynamic partition overwrite. This patch has unit tests but not integration tests; really needs to test the SQL commands through the manifest committer into gcs/abfs, or at least local fs. That would be possible once hadoop 3.3.5 is out... Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a PathOutputCommitter is compatible with dynamic partition overwrite. ### Why are the changes needed? Hadoop 3.3.5 adds a new committer in mapreduce-core which works fast and correctly on azure and gcs. (it would also work on hdfs, but its optimised for the cloud stores). The stores and the committer do meet the requirements of Spark SQL Dynamic Partition Overwrite, so it is OK to for spark to work through it. Spark does not know this; MAPREDUCE-7403 adds a way for any PathOutputCommitter to declare that they are compatible; the IntermediateManifestCommitter will do so. (apache/hadoop#4728) ### Does this PR introduce _any_ user-facing change? No. There is documentation on the feature in the hadoop [manifest committer](https://github.com/apache/hadoop/blob/82372d0d22e696643ad97490bc902fb6d17a6382/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/manifest_committer.md) docs. ### How was this patch tested? 1. Unit tests in hadoop-cloud which work with hadoop versions with/without the matching change. 2. New integration tests in https://github.com/hortonworks-spark/cloud-integration which require spark to be built against hadoop with the manifest committer declaring compatibility Those new integration tests include * spark sql test derived from spark's own [CloudRelationBasicSuite.scala#L212](https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/test/scala/org/apache/spark/sql/sources/CloudRelationBasicSuite.scala#L212) * Dataset tests extended to verify support for/rejection of dynamic partition overwrite [AbstractCommitDataframeSuite.scala#L151](https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/test/scala/com/cloudera/spark/cloud/committers/AbstractCommitDataframeSuite.scala#L151) Tested against azure cardiff with the manifest committer; s3 london (s3a committers reject dynamic partition overwrites) Closes #37468 from steveloughran/SPARK-40034-MAPREDUCE-7403-manifest-committer-partitioning. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a PathOutputCommitter is compatible with dynamic partition overwrite. This patch has unit tests but not integration tests; really needs to test the SQL commands through the manifest committer into gcs/abfs, or at least local fs. That would be possible once hadoop 3.3.5 is out... Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a PathOutputCommitter is compatible with dynamic partition overwrite. Hadoop 3.3.5 adds a new committer in mapreduce-core which works fast and correctly on azure and gcs. (it would also work on hdfs, but its optimised for the cloud stores). The stores and the committer do meet the requirements of Spark SQL Dynamic Partition Overwrite, so it is OK to for spark to work through it. Spark does not know this; MAPREDUCE-7403 adds a way for any PathOutputCommitter to declare that they are compatible; the IntermediateManifestCommitter will do so. (apache/hadoop#4728) No. There is documentation on the feature in the hadoop [manifest committer](https://github.com/apache/hadoop/blob/82372d0d22e696643ad97490bc902fb6d17a6382/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/manifest_committer.md) docs. 1. Unit tests in hadoop-cloud which work with hadoop versions with/without the matching change. 2. New integration tests in https://github.com/hortonworks-spark/cloud-integration which require spark to be built against hadoop with the manifest committer declaring compatibility Those new integration tests include * spark sql test derived from spark's own [CloudRelationBasicSuite.scala#L212](https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/test/scala/org/apache/spark/sql/sources/CloudRelationBasicSuite.scala#L212) * Dataset tests extended to verify support for/rejection of dynamic partition overwrite [AbstractCommitDataframeSuite.scala#L151](https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/test/scala/com/cloudera/spark/cloud/committers/AbstractCommitDataframeSuite.scala#L151) Tested against azure cardiff with the manifest committer; s3 london (s3a committers reject dynamic partition overwrites) Closes apache#37468 from steveloughran/SPARK-40034-MAPREDUCE-7403-manifest-committer-partitioning. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> Change-Id: Ibd45ef53f828acf39ceeaaea81f7f149a0eb6f23

steveloughran mentioned this pull request Aug 11, 2022

[SPARK-40034][SQL] PathOutputCommitters to support dynamic partitions apache/spark#37468

Closed

apache deleted a comment from hadoop-yetus Aug 11, 2022

steveloughran added 3 commits August 15, 2022 13:35

MAPREDUCE-7403. manifest-committer dynamic partitioning support.

f2ec350

Declares its compatibility with the stream capability "mapreduce.job.committer.dynamic.partitioning" spark will need to cast to StreamCapabilities and then probe. Change-Id: Iafcacc6d2491bb1e7fc2fc033c6d17d5b63b5b4f

MAPREDUCE-7403. dynamic partitioning -verify stream capabilities pass…

b259c29

…through Change-Id: Icc30bf6251977cfb76211bffcfc5796b1a44989b

steveloughran force-pushed the mr/MAPREDUCE-7403-manifest-committer-partitioning branch from f62db61 to 82372d0 Compare August 15, 2022 19:20

apache deleted a comment from hadoop-yetus Aug 15, 2022

MAPREDUCE-7403. EOLs

25db5da

Change-Id: I423f052ca48915502f182cb4f1c67cdf04838a99

apache deleted a comment from hadoop-yetus Aug 18, 2022

attilapiros approved these changes Aug 18, 2022

View reviewed changes

mukund-thakur approved these changes Aug 19, 2022

View reviewed changes

MAPREDUCE-7403. fix asserts

649b902

Change-Id: I29e98cf4ac607913d59e15babe6180434f665714

mukund-thakur reviewed Aug 22, 2022

View reviewed changes

MAPREDUCE-7403. fix accidental edit

bc9dfc9

Change-Id: Ifbe2d1012cbdf2e7467ce84a7d8d93a78e91dcf6

mukund-thakur approved these changes Aug 23, 2022

View reviewed changes

steveloughran merged commit de37fd3 into apache:trunk Aug 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MAPREDUCE-7403. manifest-committer dynamic partitioning support. #4728

MAPREDUCE-7403. manifest-committer dynamic partitioning support. #4728

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MAPREDUCE-7403. manifest-committer dynamic partitioning support. #4728

MAPREDUCE-7403. manifest-committer dynamic partitioning support. #4728

Uh oh!

Conversation

Uh oh!

Description of PR

How was this patch tested?

Testing

For code changes:

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!