8000 Extending options for populating Taken Date from file-metadata by AndyKilmory · Pull Request #4457 · guardian/grid · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Extending options for populating Taken Date from file-metadata #4457

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jun 10, 2025

Conversation

AndyKilmory
Copy link
Collaborator
@AndyKilmory AndyKilmory commented Apr 23, 2025

What does this change?

Currently the code sets the Taken Date for an image based on 1 of 3 fields in the file metadata (either read directly from the embedded metadata or formed as a composite).

In some circumstances these fields are not correctly populated and the field can be left unset even though there is data within the embedded metadata that would allow a value to be set. This PR adds an additional field to the list of fields that can be scanned to establish a taken date. The new field is 'iptc:Date Created' - this has no time component and will only be used when no time data is available. This field will be scanned only if the existing 3 fields fail to yield a value and so will not impact any images that have embedded metadata that is already supported.

This PR also introduces 2 additional date string patterns into the format list that are seen in incoming image data but are currently unsupported leading to a failure to set the taken date.

The main reason that taken date doesn't get set has now been identified as the taken date having a value after the upload time and the rule applied to discount these values as incorrect. However, we have found that in a large number of cases this is too restrictive. The embedded metadata includes taken date information with the time component given in the local timezone where the picture was taken (or the timezone to which the camera is set) but it doesn't provide any information on the timezone in many cases and under these circumstances we assume UTC - which can be incorrect. This leads to the images appearing to have taken date after the upload time when in fact the discrepancy is due to incorrect assumptions about the timezone of the taken date embedded metadata. This is particularly prevelant for images from Asia and Australia that are many hours ahead of UTC. To better handle this problem we have adapted the rule associated with 'taken date after upload time' to be 'taken date is feasible given upload time'. Timezones that are ahead of UTC can be up to 14 hours ahead and so we allow the 'taken date' to be up to 'upload time + 14 hours' - if it is beyond this value it is still deemed incorrect and rejected, but if it falls in this window it is assumed that it could be correct subject to timezone anomalies.

With this change in place we have seen a significant reduction in the number of images without a date taken value. Some still persist - mostly PA images that are embargoed and the taken 8000 date is set to the embargoed period end - which can fall outside the allowed time window and aa small number of Shutterstock images which have genuinely infeasible taken date information.

How should a reviewer test this change?

Check that all supported date format patterns are interpretted correctly
Ensure that in the event that the initial 3 fields are not present in the metadata that the 4th field gets considered correctly
Check that images with feasible time taken values (given possible time offsets from UTC) have the taken date value set despite is being after the upload time.

Who should look at this?

Tested? Documented?

  • locally by committer
  • locally by Guardian reviewer
  • on the Guardian's TEST environment
  • relevant documentation added or amended (if needed)

@AndyKilmory AndyKilmory marked this pull request as ready for review May 7, 2025 16:21
@AndyKilmory AndyKilmory requested a review from a team as a code owner May 7, 2025 16:21
@AndyKilmory AndyKilmory marked this pull request as draft May 12, 2025 15:55
@AndyKilmory AndyKilmory marked this pull request as ready for review May 19, 2025 11:17
@paperboyo
Copy link
Contributor
paperboyo commented May 19, 2025

Thanks for this! I think we should not go too far with trying to read the information that is not reliable (these are all incorrect as per spec) and we should always pressure all suppliers to be as accurate as possible when it comes to date taken (and all metadata). Recording it correctly, with subseconds, time zone and daylight savings is easiest when the camera settings are correct. And then any software rewriting the file doesn’t destroy it. Some suppliers also decided (ages ago) to incorrectly use this field(s) for eg. date sent or embargo and they must stop these practices. There are proper fields such information can be saved into which are spec-compliant (eg. Release Date+Time, Date+Time Sent, License Start Date and others)

That said, it will never be perfect. I think extending feasibility to 14 hours may make sense, although I will leave this also for others to ponder.

To two additional fields:

  1. xmp:CreateDate was left out on purpose. Every spec says that real Date Taken needs to be recorded in other places we already read and none of the reconciliation documentation treated them interchangeably. Maybe the difference is best described here for photoshop:DateCreated:
    image
    so photoshop:DateCreated and xmp:CreateDate are not the same thing. Former should be reconciled to (from) exif:DateTimeOriginal, while latter to (from) exif:DateTimeDigitised. This was also clear from old Metadata Working Guide (p. 37, Representation; watch out, some information there is out of date, eg. exif now contains time zones).
    These were the reasons to not read xmp:CreateDate as Date Taken. Whether a) the fact we would read it last and b) the fact some of our suppliers may be observing specs less vigorously than we are good enough reasons to start reading it? We are coming back to the question posed at the beginning of this comment…
  2. I think reading iptc:Date Created is not controversial. We are creating a composite from all IPTC Create Date information here, but we indeed not do it if metadata is missing Time Created. So it’s fine to read it as a last resort.

I think pt. 1 warrants some discussion as does extending feasibility to 14 hours. I am not entirely against either, but I think that if we do it, we should do it consciously. Let us know if you have any thoughts, we will add ours too.

Or, given that it’s dates and time, we may just blindly merge it without any thinking. ;-)

@AndyKilmory
Copy link
Col 8000 laborator Author

Have removed use of xmp:CreateDate for setting of taken date - with revisions to timezone management I think it would hardly have been used. Have retained 14 hour offset to allow the setting of Taken Date for images where local timezone info has been forced into UTC due to lack of knowledge about actual timezone - this seems to work well to allow for the retention of meaningful taken date information

dateTimeFormatters.foldLeft[Option[DateTime]](None){
case (successfulDate@Some(_), _) => successfulDate
// NB We refuse parse results which result in future dates, if a max date is provided.
// eg If we get a pic today (22nd January 2021) with a date string of 20211201 we can be pretty sure
// that it should be parsed as (eg) US (12th Jan 2021), not EU (1st Dec 2021).
// So we refuse the (apparently successful) EU parse result.
case (None, formatter) => safeParsing(formatter.parseDateTime(str))
.filter(d => maxDate.forall(d.isBefore) && minDate.forall(d.isAfter))
.filter(d => feasibleMaxDate.forall(d.isBefore) && minDate.forall(d.isAfter))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been kicking around the idea of setting the time portion to 0 to help communicate that while we're still willing to believe in the date the picture was taken, we're much less certain of the time... the logic becomes a bit more knotty, but how does this look to you?

      case (None, formatter) =>
        val parsedDate = safeParsing(formatter.parseDateTime(str))
        val dateInRange = parsedDate.filter(date => maxDate.forall(date.isBefore) && minDate.forall(date.isAfter))
        // It's possible that the camera has been set to the wrong time, meaning that it outputs timestamps that make no sense,
        // ie. are taken "in the future" relative to upload time. If it's within a given extra offset, allow the incorrect timestamp
        // in good faith, but set the time portion of the stamp to 0 to help communicate the uncertainty around the time portion.
        val feasibleMaxDate = maxDate.map(_.plusHours(utcHoursOffset))
        val feasibleDateButIncorrectTime = parsedDate.filter(date => feasibleMaxDate.forall(date.isBefore) && minDate.forall(date.isAfter)).map(_.withTime(0, 0, 0, 0))
        dateInRange.orElse(feasibleDateButIncorrectTime)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Andrew, thanks for the feedback - I'm not against your suggested changes but I don't think they necessarily fix the issue - just mask it in a different way. The underlying problem relates to the time information we are being sent in images from agencies - in most cases it appears that the timestamp on the image relates to the time taken in the local timezone, unfortunately no information about the timezone is included in this information - so the code just assumes UTC as a 'best guess'. When this partial time information is coupled to the upload time of the image, and the lag between the image being taken and being sent across by the agency, we get the current peculiar behaviour of some images - notably those from Australia, China and the Far East - having the Taken Date left unset. Images from Eastern Europe and the Middle East sometimes end up without a taken date and sometimes have taken date set - depending on the lag between image taken and image upload. Whereas images form the US almost always get Taken Date set as they are behind UTC. However, the vast majority of images - whether we set Taken Date or not - are subject to the same problem - the time element is normally local time and we're interpreting as UTC.

My concerns with the approach of nulling time if it appears 'incorrect' is that will suggest that we're confident about the time taken on other Taken Date values - but they are quite probably just as 'wrong' as those that we are nulling - its just that the lag between time taken and upload time makes the time appear 'correct' or that the image was taken in a timezone behind UTC. So I'm not sure if we're really adding that value by adding an arbitrary rule in this way? Perhaps we need to add in some guidance into interpreting the time element of taken date? Users could user the location context of the picture to take a more nuanced view of the time taken information. We would also be losing information regarding sequencing within a particular group of pictures - which could be of value to users?

I have no strong feelings about the final solution we opt for - I would like to see us make some changes so that we're getting taken date set on more of the images - I'd be happy to go with your suggestion - it looks good from a technical perspective - but I'd just like some business input into the decision making process. I'll raise it at standup today and see what (if any) consensus there is in the BBC. Thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I agree entirely - my only point is that this is the only case where we know for sure that we don't know what the correct time would be. The rest of the time we've no way of checking whether the time is correct, so we have to trust the supplier has set the right date.

That said its a bit messy either way, so I'm very happy to merge as is while we think it through?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both, very good points all around (eg. ordering!). I also don’t have strong feelings. But would just like to record some related thoughts. In general, I don’t think we should go too far in inventing data we don’t actually have…

  1. We have deficiencies when it comes to treating/showing/editing partial datetime data. Lack of timezone+daylight savings is a special case of partial data (granted: different to others). One day, as noted above, users should be able to tell we have only partial data and be able to edit/add it. (unlike January seen next to 01 indicating a potential partial date, there is absolutely no way of knowing we didn’t get the timezone+saving)
  2. Related to above, masking the fact we didn’t have full data lessens the pressure that should be building up to notify our suppliers and demand correct date taken values. Some suppliers are knowingly misusing this field and it needs to stop. Some got better after years of lobbying…

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi both, I spoke with the team at standup about this issue. David's feeling was that it was somewhat arbitrary to exlcude the time from images coming from the far east due to the timezone issues meaning they appear to be taken in the future, but to accept times from the US because they happen to be running behind UTC - both values are equally 'incorrect'. His sense (and that of the team as a whole) was that we either allow feasible times to be included (we'll still not set Taken Date when agencies - PA - are misusing the field) or we exclude times from all images where we don't have proper timezone information in the field to allow us to interpret it correctly (this will mean nulling time for a lot of agency images and losing the internal sequence information for any group of images). I think in the end the preference was to include time when its feasible from a timezone perspective as originally suggested. Perhaps we could have a talk through these issues at the next grid hour? Without the agencies providing better data we're always going to be compromised - but that is a very long term aspiration!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Andrew for approval

@andrew-nowak andrew-nowak merged commit fa6d732 into guardian:main Jun 10, 2025
1 check passed
@prout-bot
Copy link

Seen on auth, usage, metadata-editor, thrall, leases, cropper, collections, media-api (created by @AndyKilmory and merged by @andrew-nowak 9 minutes and 40 seconds ago) Please check your changes!

@prout-bot
Copy link

Seen on image-loader (created by @AndyKilmory and merged by @andrew-nowak 9 minutes and 45 seconds ago) Please check your changes!

@prout-bot
Copy link

Seen on kahuna (created by @AndyKilmory and merged by @andrew-nowak 9 minutes and 49 seconds ago) Please check your changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0