This is a standalone clone of TensorFlow's gfile
, supporting both local paths and gs://
(Google Cloud Storage) paths.
The main function is BlobFile
, a replacement for GFile
. There are also a few additional functions, basename
, dirname
, and join
, which mostly do the same thing as their os.path
namesakes, only they also support gs://
paths.
Installation:
pip install blobfile
Usage:
import blobfile as bf
with bf.BlobFile("gs://my-bucket-name/cats", "wb") as w:
w.write(b"meow!")
Here are the functions:
BlobFile
- likeopen()
but works withgs://
paths tooLocalBlobFile
- likeBlobFile()
but operations take place on a local file. When reading this is done by downloading the file during the constructor, for writing this means uploading the file onclose()
or during destruction. You can pass acache_dir
parameter to cache files for reading. You are reponsible for cleaning up the cache directory.
Some are inspired by existing os.path
and shutil
functions:
copy
- copy a file from one path to anotherexists
- returnsTrue
if the file or directory existsglob
- return files matching a pattern, on GCS this only supports the*
operator and can be slow if the*
appears early in the pattern since GCS can only do prefix matches, all additional filtering must happen locallyisdir
- returnsTrue
if the path is a directorylistdir
- list contents of a directorymakedirs
- ensure that a directory and all parent directories existremove
- remove a filermdir
- remove an empty directorystat
- get the size and modification time of a filewalk
- walk a directory tree, yielding(dirpath, dirnames, filenames)
tuplesbasename
- get the final component of a pathdirname
- get the path except for the final componentjoin
- join 2 or more paths together, inserting directory separators between each component
There are a few bonus functions:
cache_key
- returns a cache key that can be used for the path (this is not guaranteed to change when the content changes, but should hopefully do that)get_url
- returns a url for a path along with the expiration for that url (or None)md5
- get the md5 hash for a path, for GCS this is fast, but for other backends this may be slowset_log_callback
- set a log callback functionlog(msg: string)
to use instead of printing to stdout