Introduction

filecache is a Python module that abstracts away the location where files used or generated by a program are stored. Files can be on the local file system (/ or file://), in Google Cloud Storage (gs://), on Amazon Web Services S3 (s3://), or on a webserver (http:// or https://). There is also a fake remote that can be used to simulate a remote using the local filesystem as a file repository (fake://). The fundamental concept is that of an isolated cache defined by a FileCache instance.

Creating a FileCache Instance

There are two fundamental ways to use FileCache. The first is to create a FileCache instance directly. The second is to use FileCache as a context manager.

Example of direct creation:

from filecache import FileCache

fc = FileCache()
# perform operations on fc
# fc lives forever

Example of context manager:

from filecache import FileCache

with FileCache() as fc:
    # perform operations on fc
# fc is deleted here and the cache may be cleaned up (see below)

FileCache Attributes

A FileCache instance contains a variety of attributes that specify the location and behavior of the cache. These attributes are described below.

Cache Location

The location of the cache on the local file system is specified by a combination of a base directory and a subdirectory name.

The Base Directory

By default, the base directory is the standard temporary directory for the operating system. This is typically C:\TEMP, C:\TMP, \TEMP, or \TMP on Windows and /tmp, /var/tmp, or /usr/tmp on other platforms. The choice of temporary directory can be overridden with the TMPDIR, TEMP, or TMP environment variables, but this will affect other modules that use the system temporary directory such as tempdir. To specify a different default base directory for all FileCache instances without affecting the system temporary directory, set the FILECACHE_CACHE_ROOT environment variable. Finally, to provide a different base directory for a particular FileCache instance, pass the cache_root argument to the FileCache constructor. The cache_root argument can be a string or Path object.

The choice of base directory should be considered carefully. Depending on the operating system, the system temporary directory may be on its own filesystem, or be part of the root filesystem. Either way, the filesystem may not have enough free space to store the cache. In addition, on some operating systems the temporary directory is purged on reboot, making it unsuitable for long-term storage of cached files.

The Subdirectory Name

The subdirectory name is specified by the cache_name argument to the FileCache constructor. The default value of cache_name is "global", which will result in the subdirectory name _filecache_global. If a different name is specified, the subdirectory name will be _filecache_<cache_name>. Finally, if cache_name is None, the cache will be stored in a subdirectory with the prefix _filecache_ followed by a globally unique identifier.

The choice of cache name primarily affects sharing. If more than one program, or instance of the same program, is going to use the cached data (i.e. they download and use the same files from the same remote source), then they can share the same cache by using a common cache name. This increases performance and reduces disk space usage by avoiding duplicate downloads of the same files. The default "global" cache is a convenient place to store files that are needed by multiple programs if there is no need to otherwise segregate them. However, if a uniquely-named cache is used, then sharing is not possible because the cache name is not available to other programs.

Another consideration is the lifetime and maintenance of the cache. In addition to everything else, a cache can be considered a grouping of files that all have the same basic purpose. For some, the purpose may be long-term storage of unchanging remote data. For others, the purpose may be more ephemeral, downloading files for a single operation and then no longer needing them after the operation is complete; see the section on cache_lifetime for details on automated cache maintenance. Manual maintenance of a cache is also possible once you know its base directory and subdirectory name. For example, you can clear the cache simply by deleting the cache directory. If caches are appropriately named and segregated, then you can clear the cache for one type of data without affecting other types.

Cache Lifetime

Caches are either permanent or ephemeral. A permanent cache will never be deleted by the FileCache code and will persist on the local disk until it is manually deleted (using rm or equivalent), explicitly deleted by the program (by calling the FileCache.delete_cache() method), or deleted by the operating system on reboot, as discussed above. In contrast, an ephemeral cache will be deleted when the FileCache instance is deleted, which is at the exit of a context manager, when the FileCache instance is deleted explicitly, or on program exit.

By default, named caches are permanent and unnamed caches (cache_name is None) are ephemeral. This can be overridden by passing the delete_on_exit argument to the FileCache constructor. If delete_on_exit is True, the cache will be deleted whenever the FileCache instance is deleted. Be careful setting this to True on a named cache, as it will delete that cache on program exit while other programs may still be using it. In general, it is recommended to avoid this argument except under very specific circumstances.

Preserving File Timestamps

Files on the local filesystem usually have multiple timestamps associated with them, such as the modification time, access time, and creation time. However, files on remote sources may or may not have similar timestamps. By default, for efficiency and generality, timestamps are not preserved when downloading or uploading files. However, this can be changed by setting the time_sensitive argument to the FileCache constructor. If time_sensitive is True, then:

  • When a file is retrieved, the modification from the source location, if available, is set on the local copy. If a local copy already exists, the times on both copies are compared and the local copy is updated if the source is newer.

  • When a file is uploaded, the modification timestamp on the remote copy is set to that of the local copy, if possible.

By default, when time_sensitive is True, the modification timestamps on both local and remote copies are queried each time they are needed. Thus, if the remote copy is changed by another program during the lifetime of the FileCache instance, when FileCache.retrieve() is called, the local copy will be updated with the contents of the new remote copy. However, if the remote copy is guaranteed not to change during the lifetime of the FileCache instance, then this extra network traffic can be avoided by setting the cache_metadata argument to True. This will cause the FileCache instance to cache the metadata (such as modification time, size, and is_dir) of remote files. Methods that iterate over the contents of a directory, such as FileCache.iterdir() and FileCache.iterdir_metadata(), will also populate the metadata cache, making future retrievals more efficient.

Multi-Processor Safety

When more than one process is using the same cache, race conditions are possible if they attempt to retrieve the same file at the same time. By default, a FileCache instance will use multiprocessor-safe locking for ‘named’ (and thus possibly shared) caches to protect the cache from concurrent writes. If a named cache is guaranteed to not to be shared between multiple processes, then this can be disabled by passing the mp_safe argument to the FileCache constructor. If mp_safe is False, then the FileCache instance will not use multiprocessor-safe locking, thus improving performance. On the other hand, if mp_safe is True, then the FileCache instance will always use multiprocessor-safe locking, even for unnamed caches.

When multiprocessor-safe locking is used and multiple processes are attempting to retrieve the same file at the same time, the first process to lock the file will be able to retrieve it, while the other processes will wait for the lock to be released. The lock is released when the file is retrieved successfully or when an error occurs. If the retrieving process crashes or is killed and the lock is not successfully released, then it is possible for the waiting processes to wait forever. To avoid this, a timeout can be specified by passing the lock_timeout argument to the FileCache constructor or to the individual method. However, caution should be exercised because if the timeout is too short and file retrieval is slow, then the waiting processes may time out and raise a TimeoutError prematurely.

Parallelism

Many operations in FileCache are performed in parallel using multiple threads if all URLs are provided at the same time. These include FileCache.retrieve(), FileCache.upload(), FileCache.exists(), FileCache.is_dir(), FileCache.modification_time(), and FileCache.unlink(). The number of threads to use can be specified by passing the nthreads argument to the FileCache constructor or to the individual method. The default value is 8. The number of threads is limited by the number of files being retrieved, the number of CPUs, and the speed of the source or network. Increasing the number of threads beyond what the system can handle may reduce performance.

URL Translation

For advanced users, two mapping systems are provided. The first translates URLs into URLs, and the second translates URLs into local paths.

URL to URL Translation

FileCache is particularly useful when the same data may be available from multiple sources. For example, a dataset may be available for direct download from a website as well as from a cloud storage bucket. Depending on the particular situation, it may be desirable to choose one source over the other. Assuming the data is laid out identically, all that is required is to change the prefix to the URL, for example from https://data.com/data to gs://my-bucket/data, and all operations will work as expected. This change can be done using standard Python string manipulation functions. For example:

data_source = os.environ.get('DATA_SOURCE', 'https://data.com/data')
url = f'{data_source}/data/file.txt'

However, if the data is laid out differently, additional logic is required to determine the correct URL. The FileCache constructor (and each method) accepts a list of functions that are used to translate URLs into URLs. Each function takes three arguments: the scheme, the remote, and the path. The functions are called in order until one returns a URL, or it falls through to the default (which does nothing). This allows the programmer to write code for one particular data source, and then use the same code for other data sources with different layouts.

See Example 7: URL to URL Translation for a usage example.

URL to Path Translation

Normally, the local directory structure for a cache mirrors that of the remote source. For example, in the cache named "global", a file retrieved from https://data.com/data/file11.txt will be stored in <cache_root>/_filecache_global/http_data_com/data/file11.txt. The exact layout of the cache is normally irrelevant, as FileCache methods automatically translate URLs into local paths. In obscure cases, it may be desireable to store the data in a different hierarchy. For example, this could be useful if the user is expected to manually inspect the cache directory, but the layout of the remote source is confusing for some reason. In this case, a mapping function can be used to translate the URL into a local path with a more user-friendly format.

The user-provided translator function(s) takes five arguments: the scheme, the remote, the path, the cache directory, and the cache subdirectory. It returns a string or Path object giving the new absolute path of the cached file, or None if no translation is desired. This translation is performed on the original URL, not the URL generated by a URL to URL translator, if any.

The default translator function is:

def default_url_to_path(scheme, remote, path, cache_dir, cache_subdir):
    if scheme == 'file':
        return Path(path)
    return cache_dir / cache_subdir / path

See Example 8: URL to Path Translation for a usage example.

Basic FileCache Operations

Following are the basic operations that can be performed on a FileCache instance. Other properties and methods are available. See filecache Module for more details.

Check if a File Exists

The FileCache.exists() method is used to check if a file exists in the cache or on the remote source. It takes a URL as an argument and returns True if the file exists, False otherwise.

exists = fc.exists('https://data.com/data/file.txt')

Multiple URLs, including from different sources, can be checked at the same time by passing a list of URLs:

exists = fc.exists(['https://data.com/data/file1.txt',
                    'gs://my-bucket/data/file2.txt'])

In this case, the returned list of booleans will be in the same order as the input list of URLs.

If any kind of exception is encountered, the return value for that file will be False. In this way, it is more accurate to say that the operation checks for the accessibility of a file rather than its existence. For example, if the user does not have access to a remote source, a file there will be reported as non-existent even if it actually exists.

By default, FileCache.exists() first checks the local cache for the file. If the file is found there, it is reported as existing. If the file is not found there, it is checked on the remote source. If the local cache and remote source are out of sync (for example the remote copy of the file was deleted outside of the FileCache ecosystem), then the file could be reported as existing even though it is only the cached version that exists. The bypass_cache argument can be used to override this behavior and check the remote source directly without checking the local cache.

Retrieve a File

The FileCache.retrieve() method is used to retrieve a file from a remote source. It takes a URL as an argument and returns a local path to the downloaded file.

path = fc.retrieve('https://data.com/data/file.txt')

Multiple URLs, including from different sources, can be retrieved at the same time by passing a list of URLs:

paths = fc.retrieve(['https://data.com/data/file1.txt',
                     'gs://my-bucket/data/file2.txt'])

In this case, the returned list of paths will be in the same order as the input list of URLs.

By default, if downloading any of the files results in a failure of some kind, the entire operation will fail and an appropriate exception will be raised (when retrieving multiple files, some, but not all, of the files may have been successfully downloaded before the exception was encountered). This can be overridden by setting the exception_on_fail argument to False. In this case, the returned list of paths will contain a mixture of Path objects and Exception objects, where each Exception object reports the reason for the failure of that file.

Files can also be retrieved and opened in a single operation by using FileCache.open() as a context manager.

with fc.open('https://data.com/data/file.txt') as f:
    content = f.read()

Get the Modification Time of a File

The FileCache.modification_time() method is used to get the modification time of a file. It takes a URL as an argument and returns the modification time of the file on the remote source as a UNIX timestamp. The local cache version of the file is ignored.

mtime = fc.modification_time('https://data.com/data/file.txt')

If the file does not exist or cannot be accessed, an exception is raised by default. When called with exception_on_fail=False, those failures are returned as Exception instances instead of being raised. If the remote exists but no modification time is available, None is returned. Multiple URLs, including from different sources, can be checked at the same time by passing a list of URLs:

mtimes = fc.modification_time(['https://data.com/data/file1.txt',
                               'gs://my-bucket/data/file2.txt'])

In this case, the returned list of modification times will be in the same order as the input list of URLs.

If cache_metadata is True for the FileCache instance, then the modification time is retrieved from the metadata cache if possible. This could result in an erroneous result if the remote file changed during the execution of the program and lifetime of the cache. The remote source can be queried directly by passing bypass_cache=True to the FileCache.modification_time() method.

Upload a File

The FileCache.upload() method is used to upload a local copy of a file to a remote source. It takes a URL as an argument and returns the local path of the file that was uploaded.

path = fc.upload('https://data.com/data/file.txt')

Multiple files, including for different sources, can be uploaded at the same time by passing a list of URLs:

paths = fc.upload(['https://data.com/data/file1.txt',
                   'gs://my-bucket/data/file2.txt'])

In this case, the returned list of paths will be in the same order as the input list of URLs.

By default, if uploading any of the files results in a failure of some kind, the entire operation will fail and an appropriate exception will be raised (when uploading multiple files, some, but not all, of the files may have been successfully uploaded before the exception was encountered). This can be overridden by setting the exception_on_fail argument to False. In this case, the returned list of paths will contain a mixture of Path objects and Exception objects, where each Exception object reports the reason for the failure of that file.

Files can also be opened, written, and uploaded in a single operation by using FileCache.open() as a context manager.

with fc.open('https://data.com/data/file.txt', 'w') as f:
    f.write('Hello, World!')
# File is automatically uploaded when the context manager for the file handle exits

Deleting a File

The FileCache.unlink() method is used to delete a file from the cache and the remote source. It takes a URL as an argument and returns the local path of the file that was deleted.

path = fc.unlink('https://data.com/data/file.txt')

Multiple files, including from different sources, can be deleted at the same time by passing a list of URLs:

paths = fc.unlink(['https://data.com/data/file1.txt',
                   'gs://my-bucket/data/file2.txt'])

In this case, the returned list of paths will be in the same order as the input list of URLs.

By default, if deleting any of the files results in a failure of some kind, the entire operation will fail and an appropriate exception will be raised (when deleting multiple files, some, but not all, of the files may have been successfully deleted before the exception was encountered). This can be overridden by setting the exception_on_fail argument to False. In this case, the returned list of paths will contain a mixture of Path objects and Exception objects, where each Exception object reports the reason for the failure of that file.

FCPath

While the FileCache class provides direct oversight of files in a cache along with the necessary methods to manipulate them, it is often more convenient to operate on URLs and local files using the simpler syntax provided by the Python pathlib.Path class. The FCPath class is a reimplementation of the Path class to support remote access using an associated FileCache. It supports all of the common path operations provided by Path, as well as the ability to operate on both URLs and local files.

Every FCPath instance must be associated with a FileCache instance. An FCPath instance can be created by using the FileCache.new_path() method, in which case the association is automatic. Alternatively, an FCPath instance can be created by calling the FCPath constructor directly. In this case, by default a new FileCache instance is created with default parameters (and thus is named "global" and is permanent and not time-sensitive). A specific FileCache instance can be specified by passing it to the FCPath constructor using the filecache keyword argument. If no FileCache instance is specified, the default "global" cache is used.

Other parameters are available for the FCPath constructor or the FileCache.new_path() method (such as nthreads and url_to_url) that will override the standard values defined in the associated FileCache instance.

When FCPath instances are combined by using the / operator, the FileCache association is inherited from the left-hand side FCPath instance.

For example:

root = FCPath('https://data.com')  # Defaults to the "global" cache
path = root / 'data'/ 'file.txt'
# The path is 'https://data.com/data/file.txt' in the "global" cache

with FileCache(None, time_sensitive=True) as fc:
    root = FCPath('https://data.com', filecache=fc)
    path = root / 'data' / 'file.txt'
    # The path is 'https://data.com/data/file.txt' in the specified cache

Passing an FCPath to the FCPath constructor will create a new FCPath instance with the same FileCache association. This means that if you write a function that takes a string, Path, or FCPath, you can effectively cast any input into a FCPath by calling the constructor, without risk of losing the FileCache association:

def my_function(path: str | Path | FCPath):
    fc_path = FCPath(path)
    # fc_path is now a :class:`FCPath` instance with the same :class:`FileCache`
    # association as the input, if available, otherwise "global"

All FileCache methods can be called on FCPath instances. For example:

fc = FileCache(None)
fc_path = FCPath('https://data.com/data/file.txt', filecache=fc)
exists = fc_path.exists()

# is equivalent to...
exists = fc.exists('https://data.com/data/file.txt')

All FCPath methods that are versions of FileCache methods take an optional sub_path argument that is a path relative to the FCPath instance. This allows a single FCPath instance to be used to reference a parent directory, and then the sub-path can be used for specific files within that directory. This is effectively shorthand for the / operator. For example:

root = FCPath('https://data.com/data')
exists = root.exists('file.txt')

# is equivalent to...
fc_path = FCPath('https://data.com/data/file.txt')
exists = fc_path.exists()

Other parameters are also available (such as nthreads and url_to_url) that will override the standard values defined in the associated FileCache instance. The override hierarchy is:

  1. Parameters passed to an individual FCPath method such as FCPath.exists(), which override…

  2. Parameters passed to the FCPath constructor or the FileCache.new_path() method, which override…

  3. Parameters passed to the FileCache constructor for the associated FileCache instance, which override…

  4. Default values for FileCache

Best Practices

  • Segment data by purpose, lifetime, and sharing. Create a separate cache for each logical grouping.

  • When creating a new FCPath, be sure to specify an appropriate filecache or use the FileCache.new_path() method. Otherwise it will always use the default global cache.

  • Do not use time_sensitive=True unless you expect the remote source to change during the program’s execution.

  • Use cache_metadata=True when possible, especially if you are iterating over the contents of a directory before downloading files.