-
Notifications
You must be signed in to change notification settings - Fork 166
Make lambda hook deterministic #372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The old method would cause issues if the mtime of the files was different, since it was using the hash of the zipfile (and the zipfile includes headers for mtime for each file). This relies on generating a hash of all the files to be uploaded, then hashes all those hashes. As long as the content hasn't changed, it should return the same hash each time. This also stops relying on the ETAGs from s3, and instead just uses the calculated hash to create the key name, and doesn't re-upload if the key already exists.
Codecov Report
@@ Coverage Diff @@
## master #372 +/- ##
==========================================
+ Coverage 87.62% 87.71% +0.08%
==========================================
Files 86 86
Lines 4315 4346 +31
==========================================
+ Hits 3781 3812 +31
Misses 534 534
Continue to review full report at Codecov.
|
@danielkza for some reason it's not letting me add you as a reviewer, maybe after mentioning you it will? |
stacker/hooks/aws_lambda.py
Outdated
for fname in files: | ||
f = os.path.join(root, fname) | ||
with open(f) as fd: | ||
file_hashes.append(hashlib.md5(fd.read()).hexdigest()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a) It is probably a good idea to open the file with rb
mode to avoid newline shenanigans.
b) This might not catch file renames if the files are still produced in the same order.
It might be necessary to sort files by their full path in the ZIP file, and prepend the path to the content before hashing.
with open(os.path.join(root, ALL_FILES[0]), "w") as fd: | ||
fd.write("modified file data") | ||
hash3 = _calculate_hash(ALL_FILES, root) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to add a test verifying that the same set of files with the same content, but different file names generate a different hash.
Thanks @danielkza - updated to respond to your comments! |
9ef7766
to
a7bf254
Compare
stacker/hooks/aws_lambda.py
Outdated
file_hashes.append(hashlib.md5(fd.read()).hexdigest()) | ||
with open(f, "rb") as fd: | ||
file_hashes.append("%s:%s" % (fname, | ||
hashlib.md5(fd.read()).hexdigest())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last thing (sorry that I forgot to mention earlier): it's probably a bit more efficient to use a single hash object and incrementally append to it. That makes it easy to use \0
to separate path and content, which is better as it is almost always disallowed in filenames, and not read the whole file to memory.
files_hash = hashlib.md5()
for fname in files:
f = os.path.join(root, fname)
with open(f, "rb") as fd:
files_hash.update(fname + "\0")
for chunk in iter(lambda: fd.read(4096), ''):
files_hash.update(chunk)
return files_hash.hexdigest()
LGTM, thanks for bearing with my nitpicks :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other than sorting, lgtm
stacker/hooks/aws_lambda.py
Outdated
str: A hash of the hashes of the given files. | ||
""" | ||
file_hash = hashlib.md5() | ||
for fname in files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if _find_files returns a sorted list? Should probably perform a sort on the files before iterating through them, since non-deterministic ordering will screw up the hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And maybe add a test for this as well.
1ed5c09
to
1
8000
6c9247
Compare
@ejholmes updated to sort the file lists before calculating them and wrote a test to verify that if calculate gets a file list that is in different order the hashes still end up the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
…rminism Make lambda hook deterministic
The old method would cause issues if the mtime of the files was
different, since it was using the hash of the zipfile (and the zipfile
includes headers for mtime for each file).
This relies on generating a hash of all the files to be uploaded, then
hashes all those hashes. As long as the content hasn't changed, it
should return the same hash each time.
This also stops relying on the ETAGs from s3, and instead just uses the
calculated hash to create the key name, and doesn't re-upload if the key
already exists.
Fixes #370