#DVFD9-47 URL format of CKAN resource parser to match URL provided by data.gov.au #8

WilsonWu2100 · 2021-06-28T04:46:53Z

URL format of CKAN resource parser needs to match URL provided by data.gov.au

…ded by data.gov.au

src/Parser/CkanResourceUrlParser.php

… for comments

src/Parser/CkanResourceUrlParser.php

thery-dh · 2021-06-28T05:12:51Z

src/Parser/CkanResourceUrlParser.php

@@ -14,7 +14,7 @@ class CkanResourceUrlParser implements CkanResourceUrlParserInterface {
  /**
   * Regex to check whether a CKAN resource path is valid.
   */
-  const CKAN_RESOURCE_PATH_REGEX = '/^\/dataset\/.*\/resource\/[^\/]+$/';
+  const CKAN_RESOURCE_PATH_REGEX = '/(?:resource\/|resource_id=)([a-z0-9]{8}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{12})/';


I don't think that we should validate the resource id since if it ever changes this will break

@thery-dh I disagree, this is a robust way of ensuring a valid resource id, if it changes (I think this is unlikely) then the regex should be updated to reflect the changes

previously /dataset/fooo/resource/wrong would have indicated that wrong is a valid url + resource id. Now it is enforcing the correct format

While this is a good attempt to be thorough, I think it likely to cause issues rather than solving anything.
All this regex should do is make sure that there is a resource id and extract id. If the resource id is incorrect, then the subsequent API query will fail and should consider the resource as "non existent".
If you look at the last part of the regex for instance, it will only grab the last 12 chars of the resource id. Overtime this may very well increments to 13 or more as database grow and then you will get an incorrect id...same goes with additions of non alphanum chars, etc...

I don't believe that is correct @thery-dh the resource id is a hash and always formatted the same way with the same number of characters (likely for internal validation reasons). It is not a traditional id that increments.

In a brand new instance of CKAN the first record has a resource ID in that format, as does a recent resource id in a CKAN instance that has hundreds of thousands of resources eg https://data.gov.au/data/dataset/native-title-determination-applications-schedule/resource/d00049f3-fc1a-4536-b23e-d43f53254934

I challenge you to find a resource id that doesn't work with this ;) https://regex101.com/r/84R9UI/1/

Thanks for the info Jez.
You are right that the resource_id format is unlikely to change. Looking a bit deeper into it, it is a uuid generated in python: https://github.com/ckan/ckan/blob/master/ckan/model/resource.py#L35
using the uuid4() function: https://github.com/ckan/ckan/blob/5c386b8cd44ad41c8a94d7c7953c863d6e2f64bb/ckan/model/types.py#L18
making the format immutable.

However, I still believe that it should not be up to the connector to 8000 validate the id but instead simply used what is being passed, personal opinion I guess.

… get the resource_id from the url

#DVFD9-47 URL format of CKAN resource parser needs to match URL provi…

09866da

…ded by data.gov.au