-
Notifications
You must be signed in to change notification settings - Fork 7
#DVFD9-47 URL format of CKAN resource parser to match URL provided by data.gov.au #8
#DVFD9-47 URL format of CKAN resource parser to match URL provided by data.gov.au #8
Conversation
…ded by data.gov.au
@@ -14,7 +14,7 @@ class CkanResourceUrlParser implements CkanResourceUrlParserInterface { | |||
/** | |||
* Regex to check whether a CKAN resource path is valid. | |||
*/ | |||
const CKAN_RESOURCE_PATH_REGEX = '/^\/dataset\/.*\/resource\/[^\/]+$/'; | |||
const CKAN_RESOURCE_PATH_REGEX = '/(?:resource\/|resource_id=)([a-z0-9]{8}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{12})/'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we should validate the resource id since if it ever changes this will break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thery-dh I disagree, this is a robust way of ensuring a valid resource id, if it changes (I think this is unlikely) then the regex should be updated to reflect the changes
previously /dataset/fooo/resource/wrong
would have indicated that wrong
is a valid url + resource id. Now it is enforcing the correct format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is a good attempt to be thorough, I think it likely to cause issues rather than solving anything.
All this regex should do is make sure that there is a resource id and extract id. If the resource id is incorrect, then the subsequent API query will fail and should consider the resource as "non existent".
If you look at the last part of the regex for instance, it will only grab the last 12 chars of the resource id. Overtime this may very well increments to 13 or more as database grow and then you will get an incorrect id...same goes with additions of non alphanum chars, etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe that is correct @thery-dh the resource id is a hash and always formatted the same way with the same number of characters (likely for internal validation reasons). It is not a traditional id that increments.
In a brand new instance of CKAN the first record has a resource ID in that format, as does a recent resource id in a CKAN instance that has hundreds of thousands of resources eg https://data.gov.au/data/dataset/native-title-determination-applications-schedule/resource/d00049f3-fc1a-4536-b23e-d43f53254934
I challenge you to find a resource id that doesn't work with this ;) https://regex101.com/r/84R9UI/1/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the info Jez.
You are right that the resource_id format is unlikely to change. Looking a bit deeper into it, it is a uuid generated in python: https://github.com/ckan/ckan/blob/master/ckan/model/resource.py#L35
using the uuid4() function: https://github.com/ckan/ckan/blob/5c386b8cd44ad41c8a94d7c7953c863d6e2f64bb/ckan/model/types.py#L18
making the format immutable.
However, I still believe that it should not be up to the connector to 8000 validate the id but instead simply used what is being passed, personal opinion I guess.
… get the resource_id from the url
URL format of CKAN resource parser needs to match URL provided by data.gov.au