8000 Extending options for populating Taken Date from file-metadata by AndyKilmory · Pull Request #4457 · guardian/grid · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Extending options for populating Taken Date from file-metadata #4457

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jun 10, 2025
Merged
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4C0C
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ object ImageMetadataConverter extends GridLogging {

// keep positive years (ie AD) with a couple of days spare room to avoid accidentally flipping over into BC
val earliestSensibleDate: DateTime = DateTime.parse("0001-01-03T00:00:00.000Z")
// offset hours representing possible range of + timezones ahead of UTC
private val utcHoursOffset: Int = 14

private def extractSubjects(fileMetadata: FileMetadata): Option[List[String]] = {
val supplementalCategories = fileMetadata.iptc
Expand Down Expand Up @@ -54,10 +56,19 @@ object ImageMetadataConverter extends GridLogging {

def fromFileMetadata(fileMetadata: FileMetadata, latestAllowedDateTime: Option[DateTime] = None, earliestAllowedDateTime: Option[DateTime] = Some(earliestSensibleDate)): ImageMetadata = {
def parseDate(dateString: String): Option[DateTime] = parseRandomDate(dateString, maxDate = latestAllowedDateTime, minDate = earliestAllowedDateTime)

// setting of taken date
val exifTaken: Option[String] = fileMetadata.exifSub.get("Date/Time Original Composite")
val iptcTaken: Option[String] = fileMetadata.iptc.get("Date Time Created Composite")
val xmpTaken: Option[String] = fileMetadata.readXmpHeadStringProp("photoshop:DateCreated")
val iptcDate: Option[String] = fileMetadata.iptc.get("Date Created")
val dateTaken: Option[DateTime] = exifTaken.flatMap(parseDate) orElse
iptcTaken.flatMap(parseDate) orElse
xmpTaken.flatMap(parseDate) orElse
iptcDate.flatMap(parseDate)

ImageMetadata(
dateTaken = (fileMetadata.exifSub.get("Date/Time Original Composite") flatMap parseDate) orElse
(fileMetadata.iptc.get("Date Time Created Composite") flatMap parseDate) orElse
(fileMetadata.readXmpHeadStringProp("photoshop:DateCreated") flatMap parseDate),
dateTaken = dateTaken,
description = fileMetadata.readXmpHeadStringProp("dc:description") orElse
fileMetadata.iptc.get("Caption/Abstract") orElse
fileMetadata.exif.get("Image Description"),
Expand Down Expand Up @@ -138,21 +149,25 @@ object ImageMetadataConverter extends GridLogging {
DateTimeFormat.forPattern("yyyyddMM"),
DateTimeFormat.forPattern("yyyy"),
DateTimeFormat.forPattern("yyyy-MM"),
DateTimeFormat.forPattern("yyyy:MM:dd"),
DateTimeFormat.forPattern("yyyy-MM-dd"),

// 2014-12-16 - Maybe it's just a date
// no timezone provided so force UTC rather than use the machine's timezone
ISODateTimeFormat.date.withZoneUTC
)

private[metadata] def parseRandomDate(str: String, maxDate: Option[DateTime] = None, minDate: Option[DateTime] = None): Option[DateTime] = {
// account for images sent with local timezone info being interpreted as UTC
val feasibleMaxDate: Option[DateTime] = maxDate.map(_.plusHours(utcHoursOffset))
dateTimeFormatters.foldLeft[Option[DateTime]](None){
case (successfulDate@Some(_), _) => successfulDate
// NB We refuse parse results which result in future dates, if a max date is provided.
// eg If we get a pic today (22nd January 2021) with a date string of 20211201 we can be pretty sure
// that it should be parsed as (eg) US (12th Jan 2021), not EU (1st Dec 2021).
// So we refuse the (apparently successful) EU parse result.
case (None, formatter) => safeParsing(formatter.parseDateTime(str))
.filter(d => maxDate.forall(d.isBefore) && minDate.forall(d.isAfter))
.filter(d => feasibleMaxDate.forall(d.isBefore) && minDate.forall(d.isAfter))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been kicking around the idea of setting the time portion to 0 to help communicate that while we're still willing to believe in the date the picture was taken, we're much less certain of the time... the logic becomes a bit more knotty, but how does this look to you?

      case (None, formatter) =>
        val parsedDate = safeParsing(formatter.parseDateTime(str))
        val dateInRange = parsedDate.filter(date => maxDate.forall(date.isBefore) && minDate.forall(date.isAfter))
        // It's possible that the camera has been set to the wrong time, meaning that it outputs timestamps that make no sense,
        // ie. are taken "in the future" relative to upload time. If it's within a given extra offset, allow the incorrect timestamp
        // in good faith, but set the time portion of the stamp to 0 to help communicate the uncertainty around the time portion.
        val feasibleMaxDate = maxDate.map(_.plusHours(utcHoursOffset))
        val feasibleDateButIncorrectTime = parsedDate.filter(date => feasibleMaxDate.forall(date.isBefore) && minDate.forall(date.isAfter)).map(_.withTime(0, 0, 0, 0))
        dateInRange.orElse(feasibleDateButIncorrectTime)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Andrew, thanks for the feedback - I'm not against your suggested changes but I don't think they necessarily fix the issue - just mask it in a different way. The underlying problem relates to the time information we are being sent in images from agencies - in most cases it appears that the timestamp on the image relates to the time taken in the local timezone, unfortunately no information about the timezone is included in this information - so the code just assumes UTC as a 'best guess'. When this partial time information is coupled to the upload time of the image, and the lag between the image being taken and being sent across by the agency, we get the current peculiar behaviour of some images - notably those from Australia, China and the Far East - having the Taken Date left unset. Images from Eastern Europe and the Middle East sometimes end up without a taken date and sometimes have taken date set - depending on the lag between image taken and image upload. Whereas images form the US almost always get Taken Date set as they are behind UTC. However, the vast majority of images - whether we set Taken Date or not - are subject to the same problem - the time element is normally local time and we're interpreting as UTC.

My concerns with the approach of nulling time if it appears 'incorrect' is that will suggest that we're confident about the time taken on other Taken Date values - but they are quite probably just as 'wrong' as those that we are nulling - its just that the lag between time taken and upload time makes the time appear 'correct' or that the image was taken in a timezone behind UTC. So I'm not sure if we're really adding that value by adding an arbitrary rule in this way? Perhaps we need to add in some guidance into interpreting the time element of taken date? Users could user the location context of the picture to take a more nuanced view of the time taken information. We would also be losing information regarding sequencing within a particular group of pictures - which could be of value to users?

I have no strong feelings about the final solution we opt for - I would like to see us make some changes so that we're getting taken date set on more of the images - I'd be happy to go with your suggestion - it looks good from a technical perspective - but I'd just like some business input into the decision making process. I'll raise it at standup today and see what (if any) consensus there is in the BBC. Thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I agree entirely - my only point is that this is the only case where we know for sure that we don't know what the correct time would be. The rest of the time we've no way of checking whether the time is correct, so we have to trust the supplier has set the right date.

That said its a bit messy either way, so I'm very happy to merge as is while we think it through?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both, very good points all around (eg. ordering!). I also don’t have strong feelings. But would just like to record some related thoughts. In general, I don’t think we should go too far in inventing data we don’t actually have…

  1. We have deficiencies when it comes to treating/showing/editing partial datetime data. Lack of timezone+daylight savings is a special case of partial data (granted: different to others). One day, as noted above, users should be able to tell we have only partial data and be able to edit/add it. (unlike January seen next to 01 indicating a potential partial date, there is absolutely no way of knowing we didn’t get the timezone+saving)
  2. Related to above, masking the fact we didn’t have full data lessens the pressure that should be building up to notify our suppliers and demand correct date taken values. Some suppliers are knowingly misusing this field and it needs to stop. Some got better after years of lobbying…

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi both, I spoke with the team at standup about this issue. David's feeling was that it was somewhat arbitrary to exlcude the time from images coming from the far east due to the timezone issues meaning they appear to be taken in the future, but to accept times from the US because they happen to be running behind UTC - both values are equally 'incorrect'. His sense (and that of the team as a whole) was that we either allow feasible times to be included (we'll still not set Taken Date when agencies - PA - are misusing the field) or we exclude times from all images where we don't have proper timezone information in the field to allow us to interpret it correctly (this will mean nulling time for a lot of agency images and losing the internal sequence information for any group of images). I think in the end the preference was to include time when its feasible from a timezone perspective as originally suggested. Perhaps we could have a talk through these issues at the next grid hour? Without the agencies providing better data we're always going to be compromised - but that is a very long term aspiration!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Andrew for approval

}.map(_.withZone(DateTimeZone.UTC))
}

Expand Down
Loading
0