-
Notifications
You must be signed in to change notification settings - Fork 9.1k
MAPREDUCE-7378. Change job temporary dir name to avoid delete by other jobs #4303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @sunchao |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
ecfb617
to
61a07e2
Compare
💔 -1 overall
This message was automatically generated. |
61a07e2
to
70517f0
Compare
💔 -1 overall
This message was automatically generated. |
70517f0
to
f90a0db
Compare
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 to this., or to any other changes to FileOutputCommitter which aren't critical bugs over the correct functioning of it against hdfs for a job with exclusive access to the table. sorry
we aren't going to make changes to FileOutputCommitter. consider it stable and too critical to change. you will able to make changes to the Manifest committer of #4075 which should be safer. and the change there would be "only delete the job attempt dir (with the job unique id), not all of _temporary
see the jira I am linking your JIRA to for past discussion and future options
@@ -57,6 +57,8 @@ public class FileOutputCommitter extends PathOutputCommitter { | |||
* committed yet. | |||
*/ | |||
public static final String PENDING_DIR_NAME = "_temporary"; | |||
|
|||
public static String JOB_PENDING_DIR_NAME = "_temporary"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't actually work if you have multiple jobs running in the same process, which exactly spark drivers do. if job 2 starts while job 1 is active, job 1 will be pointed at the job2 temp dir. as a result, it will only commit those tasks which generate work after that switch, and not even notice a problem
|
||
|
||
private static void setJobPendingDirName(JobContext context) { | ||
JOB_PENDING_DIR_NAME = "_temporary_" + context.getJobID(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's an assumption here that the job ID is always unique. this doesn''t hold for all spark versions. SPARK-33402
Thank you for your reply. I will consider fo the other perfect way to handle this problem. Once i find , i will cc to you, thanks again ^_^ |
thanks for your work anyway. we do plan to make a hadoop release with the new committer later this year (it's shipping in cloudera cloud releases in preview mode, so i'll be fielding support calls there on any issues) any changes you can see there to support multi-job queries welcome. you can download and build hadoop branch-3.3 to test it in your environment. |
Description of PR
when concurrent write data to a path , the finished job will delete the "_temporary" dir cause the others failed, so we need to make they have a separate dir to write datas to a path
JIRA: MAPREDUCE-7378
How was this patch tested?
passed the all the hadoop tests
For code changes:
add a new param as the job temporary dir with the job id