8000 Extending options for populating Taken Date from file-metadata by AndyKilmory · Pull Request #4457 · guardian/grid · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Extending options for populating Taken Date from file-metadata #4457

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

AndyKilmory
Copy link
Collaborator
@AndyKilmory AndyKilmory commented Apr 23, 2025

What does this change?

Currently the code sets the Taken Date for an image based on 1 of 3 fields in the file metadata (either read directly from the embedded metadata or formed as a composite).

In some circumstances these fields are not correctly populated and the field can be left unset even though there is data within the embedded metadata that would allow a value to be set. This PR adds an additional field to the list of fields that can be scanned to establish a taken date. The new field is 'iptc:Date Created' - this has no time component and will only be used when no time data is available. This field will be scanned only if the existing 3 fields fail to yield a value and so will not impact any images that have embedded metadata that is already supported.

This PR also introduces 2 additional date string patterns into the format list that are seen in incoming image data but are currently unsupported leading to a failure to set the taken date.

The main reason that taken date doesn't get set has now been identified as the taken date having a value after the upload time and the rule applied to discount these values as incorrect. However, we have found that in a large number of cases this is too restrictive. The embedded metadata includes taken date information with the time component given in the local timezone where the picture was taken (or the timezone to which the camera is set) but it doesn't provide any information on the timezone in many cases and under these circumstances we assume UTC - which can be incorrect. This leads to the images appearing to have taken date after the upload time when in fact the discrepancy is due to incorrect assumptions about the timezone of the taken date embedded metadata. This is particularly prevelant for images from Asia and Australia that are many hours ahead of UTC. To better handle this problem we have adapted the rule associated with 'taken date after upload time' to be 'taken date is feasible given upload time'. Timezones that are ahead of UTC can be up to 14 hours ahead and so we allow the 'taken date' to be up to 'upload time + 14 hours' - if it is beyond this value it is still deemed incorrect and rejected, but if it falls in this window it is assumed that it could be correct subject to timezone anomalies.

With this change in place we have seen a significant reduction in the number of images without a date taken value. Some still persist - mostly PA images that are embargoed and the taken date is set to the embargoed period end - which can fall outside the allowed time window and aa small number of Shutterstock images which have genuinely infeasible taken date information.

How should a reviewer test this change?

Check that all supported date format patterns are interpretted correctly
Ensure that in the event that the initial 3 fields are not present in the metadata that the 4th field gets considered correctly
Check that images with feasible time taken values (given possible time offsets from UTC) have the taken date value set despite is being after the upload time.

Who should look at this?

Tested? Documented?

  • locally by committer
  • locally by Guardian reviewer
  • on the Guardian's TEST environment
  • relevant documentation added or amended (if needed)

@AndyKilmory AndyKilmory marked this pull request as ready for review May 7, 2025 16:21
@AndyKilmory AndyKilmory requested a review from a team as a code owner May 7, 2025 16:21
@AndyKilmory AndyKilmory marked this pull request as draft May 12, 2025 15:55
@AndyKilmory AndyKilmory marked this pull request as ready for review May 19, 2025 11:17
@paperboyo
Copy link
Contributor
paperboyo commented May 19, 2025

Thanks for this! I think we should not go too far with trying to read the information that is not reliable (these are all incorrect as per spec) and we should always pressure all suppliers to be as accurate as possible when it comes to date taken (and all metadata). Recording it correctly, with subseconds, time zone and daylight savings is easiest when the camera settings are correct. And then any software rewriting the file doesn’t destroy it. Some suppliers also decided (ages ago) to incorrectly use this field(s) for eg. date sent or embargo and they must stop these practices. There are proper fields such information can be saved into which are spec-compliant (eg. Release Date+Time, Date+Time Sent, License Start Date and others)

That said, it will never be perfect. I think extending feasibility to 14 hours may make sense, although I will leave this also for others to ponder.

To two additional fields:

  1. xmp:CreateDate was left out on purpose. Every spec says that real Date Taken needs to be recorded in other places we already read and none of the reconciliation documentation treated them interchangeably. Maybe the difference is best described here for photoshop:DateCreated:
    image
    so photoshop:DateCreated and xmp:CreateDate are not the same thing. Former should be reconciled to (from) exif:DateTimeOriginal, while latter to (from) exif:DateTimeDigitised. This was also clear from old Metadata Working Guide (p. 37, Representation; watch out, some information there is out of date, eg. exif now contains time zones).
    These were the reasons to not read xmp:CreateDate as Date Taken. Whether a) the fact we would read it last and b) the fact some of our suppliers may be observing specs less vigorously than we are good enough reasons to start reading it? We are coming back to the question posed at the beginning of this comment…
  2. I think reading iptc:Date Created is not controversial. We are creating a composite from all IPTC Create Date information here, but we indeed not do it if metadata is missing Time Created. So it’s fine to read it as a last resort.

I think pt. 1 warrants some discussion as does extending feasibility to 14 hours. I am not entirely against either, but I think that if we do it, we should do it consciously. Let us know if you have any thoughts, we will add ours too.

Or, given that it’s dates and time, we may just blindly merge it without any thinking. ;-)

@AndyKilmory
Copy link
Collaborator Author

Have removed use of xmp:CreateDate for setting of taken date - with revisions to timezone management I think it would hardly have been used. Have retained 14 hour offset to allow the setting of Taken Date for images where local timezone info has been forced into UTC due to lack of knowledge about actual timezone - this seems to work well to allow for the retention of meaningful taken date information

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0