Description
Related to #748
We should make it possible to specify different splits for the same dataset.
This avoids the need to re-"prep" a dataset every time; a dataset will just be a set of files in a folder--without sub-directories for "train"/"val"/"test"--and the splits will be in a separate file in that directory.
Dataset/datapipe classes should accept a splits_path
argument, that will default to None.
If the splits_path
argument is None, then the datapipe class looks in a default location for a single splits path (and raises a FileNotFoundError if it's not found).
The splits_path
wil be distinct from what we now call dataset_csv_path
. It will be a json file, basically metadata, that declares not only what we now call dataset_csv_path
but also any other paths needed for a split. In the case of a frame classification dataset, this includes the vectors of sample IDs and indices within each sample.
Probably we should rename dataset_csv_path
to something like inputs_targets_paths_csv
for clarity.
So we'll need to:
- add
splits_path
to dataset classes - modify how
prep.frame_classification
works to not make split sub-directories