Introduction
filecache is a Python module that abstracts away the location where files used or
generated by a program are stored. Files can be on the local file system (/ or
file://), in Google Cloud Storage (gs://), on Amazon Web Services S3 (s3://),
or on a webserver (http:// or https://). There is also a fake remote that can be
used to simulate a remote using the local filesystem as a file repository (fake://).
The fundamental concept is that of an isolated cache defined by a FileCache
instance.
Creating a FileCache Instance
There are two fundamental ways to use FileCache. The first is to create a
FileCache instance directly. The second is to use FileCache as a context manager.
Example of direct creation:
from filecache import FileCache
fc = FileCache()
# perform operations on fc
# fc lives forever
Example of context manager:
from filecache import FileCache
with FileCache() as fc:
# perform operations on fc
# fc is deleted here and the cache may be cleaned up (see below)
FileCache Attributes
A FileCache instance contains a variety of attributes that specify the location
and behavior of the cache. These attributes are described below.
Cache Location
The location of the cache on the local file system is specified by a combination of a base directory and a subdirectory name.
The Base Directory
By default, the base directory is the standard temporary directory for the operating
system. This is typically C:\TEMP, C:\TMP, \TEMP, or \TMP on Windows and
/tmp, /var/tmp, or /usr/tmp on other platforms. The choice of temporary
directory can be overridden with the TMPDIR, TEMP, or TMP environment
variables, but this will affect other modules that use the system temporary directory such
as tempdir. To specify a different default base directory for all FileCache
instances without affecting the system temporary directory, set the
FILECACHE_CACHE_ROOT environment variable. Finally, to provide a different base
directory for a particular FileCache instance, pass the cache_root argument
to the FileCache constructor. The cache_root argument can be a string or
Path object.
The choice of base directory should be considered carefully. Depending on the operating system, the system temporary directory may be on its own filesystem, or be part of the root filesystem. Either way, the filesystem may not have enough free space to store the cache. In addition, on some operating systems the temporary directory is purged on reboot, making it unsuitable for long-term storage of cached files.
The Subdirectory Name
The subdirectory name is specified by the cache_name argument to the FileCache
constructor. The default value of cache_name is "global", which will result in the
subdirectory name _filecache_global. If a different name is specified, the subdirectory
name will be _filecache_<cache_name>. Finally, if cache_name is None, the cache will
be stored in a subdirectory with the prefix _filecache_ followed by a globally unique
identifier.
The choice of cache name primarily affects sharing. If more than one program, or instance
of the same program, is going to use the cached data (i.e. they download and use the same
files from the same remote source), then they can share the same cache by using a common
cache name. This increases performance and reduces disk space usage by avoiding duplicate
downloads of the same files. The default "global" cache is a convenient place to store
files that are needed by multiple programs if there is no need to otherwise segregate
them. However, if a uniquely-named cache is used, then sharing is not possible because the
cache name is not available to other programs.
Another consideration is the lifetime and maintenance of the cache. In addition to everything else, a cache can be considered a grouping of files that all have the same basic purpose. For some, the purpose may be long-term storage of unchanging remote data. For others, the purpose may be more ephemeral, downloading files for a single operation and then no longer needing them after the operation is complete; see the section on cache_lifetime for details on automated cache maintenance. Manual maintenance of a cache is also possible once you know its base directory and subdirectory name. For example, you can clear the cache simply by deleting the cache directory. If caches are appropriately named and segregated, then you can clear the cache for one type of data without affecting other types.
Cache Lifetime
Caches are either permanent or ephemeral. A permanent cache will never be deleted by the
FileCache code and will persist on the local disk until it is manually deleted
(using rm or equivalent), explicitly deleted by the program (by calling the
FileCache.delete_cache() method), or deleted by the operating system on reboot, as
discussed above. In contrast, an ephemeral cache will be deleted when the
FileCache instance is deleted, which is at the exit of a context manager, when
the FileCache instance is deleted explicitly, or on program exit.
By default, named caches are permanent and unnamed caches (cache_name is None) are
ephemeral. This can be overridden by passing the delete_on_exit argument to the
FileCache constructor. If delete_on_exit is True, the cache will be
deleted whenever the FileCache instance is deleted. Be careful setting this to
True on a named cache, as it will delete that cache on program exit while other
programs may still be using it. In general, it is recommended to avoid this argument except
under very specific circumstances.
Preserving File Timestamps
Files on the local filesystem usually have multiple timestamps associated with them, such
as the modification time, access time, and creation time. However, files on remote sources
may or may not have similar timestamps. By default, for efficiency and generality,
timestamps are not preserved when downloading or uploading files. However, this can be
changed by setting the time_sensitive argument to the FileCache constructor.
If time_sensitive is True, then:
When a file is retrieved, the modification from the source location, if available, is set on the local copy. If a local copy already exists, the times on both copies are compared and the local copy is updated if the source is newer.
When a file is uploaded, the modification timestamp on the remote copy is set to that of the local copy, if possible.
By default, when time_sensitive is True, the modification timestamps on both local
and remote copies are queried each time they are needed. Thus, if the remote copy is
changed by another program during the lifetime of the FileCache instance, when
FileCache.retrieve() is called, the local copy will be updated with the contents of
the new remote copy. However, if the remote copy is guaranteed not to change during the
lifetime of the FileCache instance, then this extra network traffic can be avoided
by setting the cache_metadata argument to True. This will cause the
FileCache instance to cache the metadata (such as modification time, size, and
is_dir) of remote files. Methods that iterate over the contents of a directory, such
as FileCache.iterdir() and FileCache.iterdir_metadata(), will also populate
the metadata cache, making future retrievals more efficient.
Multi-Processor Safety
When more than one process is using the same cache, race conditions are possible if they
attempt to retrieve the same file at the same time. By default, a FileCache
instance will use multiprocessor-safe locking for ‘named’ (and thus possibly shared)
caches to protect the cache from concurrent writes. If a named cache is guaranteed to not
to be shared between multiple processes, then this can be disabled by passing the
mp_safe argument to the FileCache constructor. If mp_safe is False,
then the FileCache instance will not use multiprocessor-safe locking, thus
improving performance. On the other hand, if mp_safe is True, then the
FileCache instance will always use multiprocessor-safe locking, even for unnamed
caches.
When multiprocessor-safe locking is used and multiple processes are attempting to retrieve
the same file at the same time, the first process to lock the file will be able to
retrieve it, while the other processes will wait for the lock to be released. The lock is
released when the file is retrieved successfully or when an error occurs. If the
retrieving process crashes or is killed and the lock is not successfully released, then it
is possible for the waiting processes to wait forever. To avoid this, a timeout can be
specified by passing the lock_timeout argument to the FileCache constructor
or to the individual method. However, caution should be exercised because if the timeout
is too short and file retrieval is slow, then the waiting processes may time out and raise
a TimeoutError prematurely.
Parallelism
Many operations in FileCache are performed in parallel using multiple threads if
all URLs are provided at the same time. These include FileCache.retrieve(),
FileCache.upload(), FileCache.exists(), FileCache.is_dir(),
FileCache.modification_time(), and FileCache.unlink(). The number of threads
to use can be specified by passing the nthreads argument to the FileCache
constructor or to the individual method. The default value is 8. The number of threads is
limited by the number of files being retrieved, the number of CPUs, and the speed of the
source or network. Increasing the number of threads beyond what the system can handle may
reduce performance.
URL Translation
For advanced users, two mapping systems are provided. The first translates URLs into URLs, and the second translates URLs into local paths.
URL to URL Translation
FileCache is particularly useful when the same data may be available from
multiple sources. For example, a dataset may be available for direct download from a
website as well as from a cloud storage bucket. Depending on the particular situation, it
may be desirable to choose one source over the other. Assuming the data is laid out
identically, all that is required is to change the prefix to the URL, for example from
https://data.com/data to gs://my-bucket/data, and all operations will work as
expected. This change can be done using standard Python string manipulation functions. For
example:
data_source = os.environ.get('DATA_SOURCE', 'https://data.com/data')
url = f'{data_source}/data/file.txt'
However, if the data is laid out differently, additional logic is required to determine
the correct URL. The FileCache constructor (and each method) accepts a list of
functions that are used to translate URLs into URLs. Each function takes three arguments:
the scheme, the remote, and the path. The functions are called in order until one returns
a URL, or it falls through to the default (which does nothing). This allows the programmer
to write code for one particular data source, and then use the same code for other data
sources with different layouts.
See Example 7: URL to URL Translation for a usage example.
URL to Path Translation
Normally, the local directory structure for a cache mirrors that of the remote source. For
example, in the cache named "global", a file retrieved from
https://data.com/data/file11.txt will be stored in
<cache_root>/_filecache_global/http_data_com/data/file11.txt. The exact layout of the
cache is normally irrelevant, as FileCache methods automatically translate URLs
into local paths. In obscure cases, it may be desireable to store the data in a different
hierarchy. For example, this could be useful if the user is expected to manually inspect
the cache directory, but the layout of the remote source is confusing for some reason. In
this case, a mapping function can be used to translate the URL into a local path with a
more user-friendly format.
The user-provided translator function(s) takes five arguments: the scheme, the remote, the
path, the cache directory, and the cache subdirectory. It returns a string or
Path object giving the new absolute path of the cached file, or None if no
translation is desired. This translation is performed on the original URL, not the URL
generated by a URL to URL translator, if any.
The default translator function is:
def default_url_to_path(scheme, remote, path, cache_dir, cache_subdir):
if scheme == 'file':
return Path(path)
return cache_dir / cache_subdir / path
See Example 8: URL to Path Translation for a usage example.
Basic FileCache Operations
Following are the basic operations that can be performed on a FileCache instance. Other
properties and methods are available. See filecache Module for more details.
Check if a File Exists
The FileCache.exists() method is used to check if a file exists in the cache or on
the remote source. It takes a URL as an argument and returns True if the file exists,
False otherwise.
exists = fc.exists('https://data.com/data/file.txt')
Multiple URLs, including from different sources, can be checked at the same time by passing a list of URLs:
exists = fc.exists(['https://data.com/data/file1.txt',
'gs://my-bucket/data/file2.txt'])
In this case, the returned list of booleans will be in the same order as the input list of URLs.
If any kind of exception is encountered, the return value for that file will be
False. In this way, it is more accurate to say that the operation checks for the
accessibility of a file rather than its existence. For example, if the user does not have
access to a remote source, a file there will be reported as non-existent even if it
actually exists.
By default, FileCache.exists() first checks the local cache for the file. If the
file is found there, it is reported as existing. If the file is not found there, it is
checked on the remote source. If the local cache and remote source are out of sync (for
example the remote copy of the file was deleted outside of the FileCache
ecosystem), then the file could be reported as existing even though it is only the cached
version that exists. The bypass_cache argument can be used to override this behavior
and check the remote source directly without checking the local cache.
Retrieve a File
The FileCache.retrieve() method is used to retrieve a file from a remote source.
It takes a URL as an argument and returns a local path to the downloaded file.
path = fc.retrieve('https://data.com/data/file.txt')
Multiple URLs, including from different sources, can be retrieved at the same time by passing a list of URLs:
paths = fc.retrieve(['https://data.com/data/file1.txt',
'gs://my-bucket/data/file2.txt'])
In this case, the returned list of paths will be in the same order as the input list of URLs.
By default, if downloading any of the files results in a failure of some kind, the entire
operation will fail and an appropriate exception will be raised (when retrieving multiple
files, some, but not all, of the files may have been successfully downloaded before the
exception was encountered). This can be overridden by setting the exception_on_fail
argument to False. In this case, the returned list of paths will contain a mixture of
Path objects and Exception objects, where each Exception object
reports the reason for the failure of that file.
Files can also be retrieved and opened in a single operation by using
FileCache.open() as a context manager.
with fc.open('https://data.com/data/file.txt') as f:
content = f.read()
Get the Modification Time of a File
The FileCache.modification_time() method is used to get the modification time of a
file. It takes a URL as an argument and returns the modification time of the file on the
remote source as a UNIX timestamp. The local cache version of the file is ignored.
mtime = fc.modification_time('https://data.com/data/file.txt')
If the file does not exist or cannot be accessed, an exception is raised by
default. When called with exception_on_fail=False, those failures are
returned as Exception instances instead of being raised. If the
remote exists but no modification time is available, None is returned.
Multiple URLs, including from different sources, can be checked at the same time by
passing a list of URLs:
mtimes = fc.modification_time(['https://data.com/data/file1.txt',
'gs://my-bucket/data/file2.txt'])
In this case, the returned list of modification times will be in the same order as the input list of URLs.
If cache_metadata is True for the FileCache instance, then the
modification time is retrieved from the metadata cache if possible. This could result in
an erroneous result if the remote file changed during the execution of the program and
lifetime of the cache. The remote source can be queried directly by passing
bypass_cache=True to the FileCache.modification_time() method.
Upload a File
The FileCache.upload() method is used to upload a local copy of a file to a remote
source. It takes a URL as an argument and returns the local path of the file that was uploaded.
path = fc.upload('https://data.com/data/file.txt')
Multiple files, including for different sources, can be uploaded at the same time by passing a list of URLs:
paths = fc.upload(['https://data.com/data/file1.txt',
'gs://my-bucket/data/file2.txt'])
In this case, the returned list of paths will be in the same order as the input list of URLs.
By default, if uploading any of the files results in a failure of some kind, the entire
operation will fail and an appropriate exception will be raised (when uploading multiple
files, some, but not all, of the files may have been successfully uploaded before the
exception was encountered). This can be overridden by setting the exception_on_fail
argument to False. In this case, the returned list of paths will contain a mixture of
Path objects and Exception objects, where each Exception object
reports the reason for the failure of that file.
Files can also be opened, written, and uploaded in a single operation by using
FileCache.open() as a context manager.
with fc.open('https://data.com/data/file.txt', 'w') as f:
f.write('Hello, World!')
# File is automatically uploaded when the context manager for the file handle exits
Deleting a File
The FileCache.unlink() method is used to delete a file from the cache and the remote
source. It takes a URL as an argument and returns the local path of the file that was
deleted.
path = fc.unlink('https://data.com/data/file.txt')
Multiple files, including from different sources, can be deleted at the same time by passing a list of URLs:
paths = fc.unlink(['https://data.com/data/file1.txt',
'gs://my-bucket/data/file2.txt'])
In this case, the returned list of paths will be in the same order as the input list of URLs.
By default, if deleting any of the files results in a failure of some kind, the entire
operation will fail and an appropriate exception will be raised (when deleting multiple
files, some, but not all, of the files may have been successfully deleted before the
exception was encountered). This can be overridden by setting the exception_on_fail
argument to False. In this case, the returned list of paths will contain a mixture of
Path objects and Exception objects, where each Exception object
reports the reason for the failure of that file.
FCPath
While the FileCache class provides direct oversight of files in a cache along
with the necessary methods to manipulate them, it is often more convenient to operate on
URLs and local files using the simpler syntax provided by the Python pathlib.Path
class. The FCPath class is a reimplementation of the Path class to
support remote access using an associated FileCache. It supports all of the
common path operations provided by Path, as well as the ability to operate on
both URLs and local files.
Every FCPath instance must be associated with a FileCache instance. An
FCPath instance can be created by using the FileCache.new_path() method, in which
case the association is automatic. Alternatively, an FCPath instance can be created by
calling the FCPath constructor directly. In this case, by default a new FileCache
instance is created with default parameters (and thus is named "global" and is permanent and
not time-sensitive). A specific FileCache instance can be specified by passing it to the
FCPath constructor using the filecache keyword argument. If no FileCache
instance is specified, the default "global" cache is used.
Other parameters are available for the FCPath constructor or the
FileCache.new_path() method (such as nthreads and url_to_url) that will
override the standard values defined in the associated FileCache instance.
When FCPath instances are combined by using the / operator, the
FileCache association is inherited from the left-hand side FCPath instance.
For example:
root = FCPath('https://data.com') # Defaults to the "global" cache
path = root / 'data'/ 'file.txt'
# The path is 'https://data.com/data/file.txt' in the "global" cache
with FileCache(None, time_sensitive=True) as fc:
root = FCPath('https://data.com', filecache=fc)
path = root / 'data' / 'file.txt'
# The path is 'https://data.com/data/file.txt' in the specified cache
Passing an FCPath to the FCPath constructor will create a new FCPath
instance with the same FileCache association. This means that if you write a
function that takes a string, Path, or FCPath, you can effectively cast
any input into a FCPath by calling the constructor, without risk of losing the
FileCache association:
def my_function(path: str | Path | FCPath):
fc_path = FCPath(path)
# fc_path is now a :class:`FCPath` instance with the same :class:`FileCache`
# association as the input, if available, otherwise "global"
All FileCache methods can be called on FCPath instances. For example:
fc = FileCache(None)
fc_path = FCPath('https://data.com/data/file.txt', filecache=fc)
exists = fc_path.exists()
# is equivalent to...
exists = fc.exists('https://data.com/data/file.txt')
All FCPath methods that are versions of FileCache methods take an
optional sub_path argument that is a path relative to the FCPath instance. This
allows a single FCPath instance to be used to reference a parent directory, and then
the sub-path can be used for specific files within that directory. This is effectively shorthand
for the / operator. For example:
root = FCPath('https://data.com/data')
exists = root.exists('file.txt')
# is equivalent to...
fc_path = FCPath('https://data.com/data/file.txt')
exists = fc_path.exists()
Other parameters are also available (such as nthreads and url_to_url) that will
override the standard values defined in the associated FileCache instance. The
override hierarchy is:
Parameters passed to an individual
FCPathmethod such asFCPath.exists(), which override…Parameters passed to the
FCPathconstructor or theFileCache.new_path()method, which override…Parameters passed to the
FileCacheconstructor for the associatedFileCacheinstance, which override…Default values for
FileCache
Best Practices
Segment data by purpose, lifetime, and sharing. Create a separate cache for each logical grouping.
When creating a new
FCPath, be sure to specify an appropriate filecache or use theFileCache.new_path()method. Otherwise it will always use the default global cache.Do not use
time_sensitive=Trueunless you expect the remote source to change during the program’s execution.Use
cache_metadata=Truewhen possible, especially if you are iterating over the contents of a directory before downloading files.