8000 TEZ-4631: Include an official script that installs hadoop and tez and runs a simple example DAG by abstractdog · Pull Request #414 · apache/tez · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

TEZ-4631: Include an official script that installs hadoop and tez and runs a simple example DAG #414

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 28, 2025

Conversation

abstractdog
Copy link
Contributor
@abstractdog abstractdog commented May 16, 2025

Running a simple Tez example from the terminal can be challenging, especially since the installation guide isn’t always up to date. In the long term, providing both a clear web presence and a convenience script would be essential for maintaining the project’s health and earning users’ trust.

The introduced dev-support/bin folder structure follows the hadoop's one: https://github.com/apache/hadoop/tree/trunk/dev-support/bin

@abstractdog abstractdog requested a review from ayushtkn May 16, 2025 07:49
@tez-yetus

This comment was marked as outdated.

Copy link
Member
@ayushtkn ayushtkn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to specify: mapreduce.framework.name as yarn as well?

for me earlier it never use to work unless, I specify export HADOOP_USER_CLASSPATH_FIRST=true

does it work for you without that, even BigTop had to add that
https://github.com/apache/bigtop/pull/1246/files#diff-f68b85f9302907e466b58d438376afb074df98fdbe571d30c188cd1767ff11eeR18

#$HADOOP_HOME/sbin/stop-dfs.sh
#$HADOOP_HOME/sbin/stop-yarn.sh

hdfs namenode -format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you run it twice, usually if there was previous installation & if you run namenode -format, it asks for a prompt, are you sure you want to delete & you need to give Y

2025-05-16 13:43:11,940 INFO snapshot.SnapshotManager: SkipList is disabled
2025-05-16 13:43:11,942 INFO util.GSet: Computing capacity for map cachedBlocks
2025-05-16 13:43:11,942 INFO util.GSet: VM type       = 64-bit
2025-05-16 13:43:11,942 INFO util.GSet: 0.25% max memory 7.1 GB = 18.2 MB
2025-05-16 13:43:11,942 INFO util.GSet: capacity      = 2^21 = 2097152 entries
2025-05-16 13:43:11,949 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
2025-05-16 13:43:11,949 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
2025-05-16 13:43:11,949 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
2025-05-16 13:43:12,062 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
2025-05-16 13:43:12,062 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
2025-05-16 13:43:12,064 INFO util.GSet: Computing capacity for map NameNodeRetryCache
2025-05-16 13:43:12,064 INFO util.GSet: VM type       = 64-bit
2025-05-16 13:43:12,064 INFO util.GSet: 0.029999999329447746% max memory 7.1 GB = 2.2 MB
2025-05-16 13:43:12,064 INFO util.GSet: capacity      = 2^18 = 262144 entries
Re-format filesystem in Storage Directory root= /tmp/hadoop-ayushsaxena/dfs/name; location= null ? (Y or N) 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a " -force" option of namenode format, let me try

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-force worked

Comment on lines 81 to 82
hadoop fs -mkdir /apps/
hadoop fs -mkdir /apps/tez-$TEZ_VERSION
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hadoop fs -mkdir -p /apps/tez-$TEZ_VERSION

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, will do

hadoop fs -copyFromLocal words.txt /words.txt

# finally run the example
hadoop jar $TEZ_HOME/tez-examples-$TEZ_VERSION.jar orderedwordcount /words.txt /words_out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do yarn jar instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't used yarn executable so far, fine with changing, but for the record: what advantages does it have?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK for YARN it should be yarn jar
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/bin/hadoop.cmd#L186-L190

if you have yarn opts defined and all, it would shooting a warning as well. hadoop jar was for MR job, though it doesn't fail for Tez job today


# configure this if needed, by default it will use the latest stable versions in the current directory
export TEZ_VERSION=$(curl -s "https://downloads.apache.org/tez/" | grep -oP '\K[0-9]+\.[0-9]+\.[0-9]+(?=/)' | sort -V | tail -1) # e.g. 0.10.4
export HADOOP_VERSION=$(curl -s "https://downloads.apache.org/hadoop/common/" | grep -oP 'hadoop-\K[0-9]+\.[0-9]+\.[0-9]+(?=/)' | sort -V | tail -1) # e.g. 3.4.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the hadoop version should be from the pom? not always the latest version is gonna work with Tez

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, depends on what we want to achieve with this script, here is what I can think of:

  1. get hadoop from the tez pom.xml as you adviced
  2. both HADOOP_VERSION and TEZ_VERSION could be used from env if already defined (making the user able to define any for random experience)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am like if the user defines it use it else get it from the POM, I believe that is what the Hive docker build script does that as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, makes sense, let me do the same

Comment on lines 14 to 17
cd $HADOOP_STACK_HOME
wget -nc https://a 8000 rchive.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz
wget -nc https://archive.apache.org/dist/tez/$TEZ_VERSION/apache-tez-$TEZ_VERSION-bin.tar.gz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there some caching possible? like if it is already present we don't download it again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-nc (--no-clobber) is exactly what takes care of this

@abstractdog
Copy link
Contributor Author
abstractdog commented May 18, 2025

Do we need to specify: mapreduce.framework.name as yarn as well?

for me earlier it never use to work unless, I specify export HADOOP_USER_CLASSPATH_FIRST=true

does it work for you without that, even BigTop had to add that https://github.com/apache/bigtop/pull/1246/files#diff-f68b85f9302907e466b58d438376afb074df98fdbe571d30c188cd1767ff11eeR18

yeah, I can see this workaround happening everywhere, but here, it has just worked OOTB, maybe a certain state of defining ENV vars like HADOOP_CLASSPATH? I don't know
what about:

  1. I'm playing with it if I can reproduce their problems
  2. can you try the script on your side if it works? if the script works without the additional export for you too, we might want to publish it as is, proving that no further classpath hack are needed

let me check mapreduce.framework.name as well, for me, the script simply ran a Tez DAG, so I haven't configured anything more...but this is really interesting, I'll discover

@abstractdog
Copy link
Contributor Author
abstractdog commented May 20, 2025

Do we need to specify: mapreduce.framework.name as yarn as well?
for me earlier it never use to work unless, I specify export HADOOP_USER_CLASSPATH_FIRST=true
does it work for you without that, even BigTop had to add that https://github.com/apache/bigtop/pull/1246/files#diff-f68b85f9302907e466b58d438376afb074df98fdbe571d30c188cd1767ff11eeR18

yeah, I can see this workaround happening everywhere, but here, it has just worked OOTB, maybe a certain state of defining ENV vars like HADOOP_CLASSPATH? I don't know what about:

  1. I'm playing with it if I can reproduce their problems
  2. can you try the script on your side if it works? if the script works without the additional export for you too, we might want to publish it as is, proving that no further classpath hack are needed

let me check mapreduce.framework.name as well, for me, the script simply ran a Tez DAG, so I haven't configured anything more...but this is really interesting, I'll discover

wow, that's indeed needed, otherwise I get exotic exception like

java.lang.IllegalAccessError: tried to access field com.google.protobuf.AbstractMessage.memoizedSize from class org.apache.tez.dag.api.records.DAGProtos$ConfigurationProto
	at org.apache.tez.dag.api.records.DAGProtos$ConfigurationProto.getSerializedSize(DAGProtos.java:21080)
	at com.google.protobuf.AbstractMessageLite.writeTo(AbstractMessageLite.java:75)
	at org.apache.tez.common.TezUtils.writeConfInPB(TezUtils.java:162)
	at org.apache.tez.common.TezUtils.createByteStringFromConf(TezUtils.java:82)
	at org.apache.tez.mapreduce.hadoop.MRInputHelpers.createMRInputPayload(MRInputHelpers.java:717)
	at org.apache.tez.mapreduce.input.MRInput$MRInputHelpersInternal.createMRInputPayload(MRInput.java:712)
	at org.apache.tez.mapreduce.input.MRInput$MRInputConfigBuilder.createGeneratorDataSource(MRInput.java:336)
	at org.apache.tez.mapreduce.input.MRInput$MRInputConfigBuilder.build(MRInput.java:266)
	at org.apache.tez.examples.OrderedWordCount.createDAG(OrderedWordCount.java:130)
	at org.apache.tez.examples.OrderedWordCount.runJob(OrderedWordCount.java:200)
	at org.apache.tez.examples.TezExampleBase._execute(TezExampleBase.java:245)
	at org.apache.tez.examples.TezExampleBase.run(TezExampleBase.java:126)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
	at org.apache.tez.examples.OrderedWordCount.main(OrderedWordCount.java:208)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
	at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
	at org.apache.tez.examples.ExampleDriver.main(ExampleDriver.java:51)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:328)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:241)

adding this export right before the Tez DAG submission

maybe it ran successfully before because I ran the steps one-by-one and I had environment exports, so messed up something/anything

@tez-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 28m 58s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 shelldocs 0m 0s Shelldocs was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ master Compile Tests _
+0 🆗 mvndep 2m 23s Maven dependency ordering for branch
_ Patch Compile Tests _
+0 🆗 mvndep 0m 8s Maven dependency ordering for patch
+1 💚 codespell 0m 4s No new issues.
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 shellcheck 0m 0s No new issues.
_ Other Tests _
+0 🆗 asflicense 0m 0s ASF License check generated no output?
31m 54s
Subsystem Report/Notes
Docker ClientAPI=1.49 ServerAPI=1.49 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-414/2/artifact/out/Dockerfile
GITHUB PR #414
Optional Tests dupname asflicense codespell detsecrets shellcheck shelldocs
uname Linux 301ac7039060 5.15.0-136-generic #147-Ubuntu SMP Sat Mar 15 15:53:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality /home/jenkins/jenkins-home/workspace/tez-multibranch_PR-414/src/.yetus/personality.sh
git revision master / bd94d8b
Max. process+thread count 60 (vs. ulimit of 5500)
modules C: U:
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-414/2/console
versions git=2.34.1 maven=3.6.3 codespell=2.0.0 shellcheck=0.7.1
Powered by Apache Yetus 0.15.1 https://yetus.apache.org

This message was automatically generated.

@abstractdog abstractdog requested a review from ayushtkn May 21, 2025 12:22
Copy link
Member
@ayushtkn ayushtkn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried Locally & it works for me.

LGTM

image

@abstractdog abstractdog merged commit e847435 into apache:master May 28, 2025
2 checks passed
abstractdog added a commit to abstractdog/tez that referenced this pull request Jun 3, 2025
… runs a simple example DAG (apache#414) - addendum ASF license
abstractdog added a commit to abstractdog/tez that referenced this pull request Jun 3, 2025
… runs a simple example DAG (apache#414) - addendum ASF license
abstractdog added a commit to abstractdog/tez that referenced this pull request Jun 3, 2025
… runs a simple example DAG (apache#414) - addendum ASF license
abstractdog added a commit to abstractdog/tez that referenced this pull request Jun 3, 2025
… runs a simple example DAG (apache#414) - addendum ASF license
abstractdog added a commit that referenced this pull request Jun 4, 2025
… runs a simple example DAG (#414) - addendum ASF license + shellcheck fixes (#417) (Laszlo Bodor reviewed by Ayush Saxena)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0