Introduction ============ ``filecache`` is a Python module that abstracts away the location where files used or generated by a program are stored. Files can be on the local file system (``/`` or ``file://``), in Google Cloud Storage (``gs://``), on Amazon Web Services S3 (``s3://``), or on a webserver (``http://`` or ``https://``). There is also a fake remote that can be used to simulate a remote using the local filesystem as a file repository (``fake://``). The fundamental concept is that of an isolated cache defined by a :class:`FileCache` instance. Creating a FileCache Instance ***************************** There are two fundamental ways to use :class:`FileCache`. The first is to create a :class:`FileCache` instance directly. The second is to use :class:`FileCache` as a context manager. Example of direct creation: .. code-block:: python from filecache import FileCache fc = FileCache() # perform operations on fc # fc lives forever Example of context manager: .. code-block:: python from filecache import FileCache with FileCache() as fc: # perform operations on fc # fc is deleted here and the cache may be cleaned up (see below) FileCache Attributes ******************** A :class:`FileCache` instance contains a variety of attributes that specify the location and behavior of the cache. These attributes are described below. .. _cache_location: Cache Location -------------- The location of the cache on the local file system is specified by a combination of a base directory and a subdirectory name. The Base Directory ^^^^^^^^^^^^^^^^^^ By default, the base directory is the standard temporary directory for the operating system. This is typically ``C:\TEMP``, ``C:\TMP``, ``\TEMP``, or ``\TMP`` on Windows and ``/tmp``, ``/var/tmp``, or ``/usr/tmp`` on other platforms. The choice of temporary directory can be overridden with the ``TMPDIR``, ``TEMP``, or ``TMP`` environment variables, but this will affect other modules that use the system temporary directory such as ``tempdir``. To specify a different default base directory for all :class:`FileCache` instances without affecting the system temporary directory, set the ``FILECACHE_CACHE_ROOT`` environment variable. Finally, to provide a different base directory for a particular :class:`FileCache` instance, pass the ``cache_root`` argument to the :class:`FileCache` constructor. The ``cache_root`` argument can be a string or `Path` object. The choice of base directory should be considered carefully. Depending on the operating system, the system temporary directory may be on its own filesystem, or be part of the root filesystem. Either way, the filesystem may not have enough free space to store the cache. In addition, on some operating systems the temporary directory is purged on reboot, making it unsuitable for long-term storage of cached files. The Subdirectory Name ^^^^^^^^^^^^^^^^^^^^^ The subdirectory name is specified by the ``cache_name`` argument to the :class:`FileCache` constructor. The default value of ``cache_name`` is ``"global"``, which will result in the subdirectory name ``_filecache_global``. If a different name is specified, the subdirectory name will be ``_filecache_``. Finally, if ``cache_name`` is ``None``, the cache will be stored in a subdirectory with the prefix ``_filecache_`` followed by a globally unique identifier. The choice of cache name primarily affects sharing. If more than one program, or instance of the same program, is going to use the cached data (i.e. they download and use the same files from the same remote source), then they can share the same cache by using a common cache name. This increases performance and reduces disk space usage by avoiding duplicate downloads of the same files. The default ``"global"`` cache is a convenient place to store files that are needed by multiple programs if there is no need to otherwise segregate them. However, if a uniquely-named cache is used, then sharing is not possible because the cache name is not available to other programs. Another consideration is the lifetime and maintenance of the cache. In addition to everything else, a cache can be considered a grouping of files that all have the same basic purpose. For some, the purpose may be long-term storage of unchanging remote data. For others, the purpose may be more ephemeral, downloading files for a single operation and then no longer needing them after the operation is complete; see the section on `cache_lifetime`_ for details on automated cache maintenance. Manual maintenance of a cache is also possible once you know its base directory and subdirectory name. For example, you can clear the cache simply by deleting the cache directory. If caches are appropriately named and segregated, then you can clear the cache for one type of data without affecting other types. .. _cache_lifetime: Cache Lifetime -------------- Caches are either permanent or ephemeral. A permanent cache will never be deleted by the :class:`FileCache` code and will persist on the local disk until it is manually deleted (using ``rm`` or equivalent), explicitly deleted by the program (by calling the :meth:`FileCache.delete_cache` method), or deleted by the operating system on reboot, as discussed above. In contrast, an ephemeral cache will be deleted when the :class:`FileCache` instance is deleted, which is at the exit of a context manager, when the :class:`FileCache` instance is deleted explicitly, or on program exit. By default, named caches are permanent and unnamed caches (``cache_name`` is ``None``) are ephemeral. This can be overridden by passing the ``delete_on_exit`` argument to the :class:`FileCache` constructor. If ``delete_on_exit`` is ``True``, the cache will be deleted whenever the :class:`FileCache` instance is deleted. Be careful setting this to ``True`` on a named cache, as it will delete that cache on program exit while other programs may still be using it. In general, it is recommended to avoid this argument except under very specific circumstances. .. _cache_time: Preserving File Timestamps -------------------------- Files on the local filesystem usually have multiple timestamps associated with them, such as the modification time, access time, and creation time. However, files on remote sources may or may not have similar timestamps. By default, for efficiency and generality, timestamps are not preserved when downloading or uploading files. However, this can be changed by setting the ``time_sensitive`` argument to the :class:`FileCache` constructor. If ``time_sensitive`` is ``True``, then: - When a file is retrieved, the modification from the source location, if available, is set on the local copy. If a local copy already exists, the times on both copies are compared and the local copy is updated if the source is newer. - When a file is uploaded, the modification timestamp on the remote copy is set to that of the local copy, if possible. By default, when ``time_sensitive`` is ``True``, the modification timestamps on both local and remote copies are queried each time they are needed. Thus, if the remote copy is changed by another program during the lifetime of the :class:`FileCache` instance, when :meth:`FileCache.retrieve` is called, the local copy will be updated with the contents of the new remote copy. However, if the remote copy is guaranteed not to change during the lifetime of the :class:`FileCache` instance, then this extra network traffic can be avoided by setting the ``cache_metadata`` argument to ``True``. This will cause the :class:`FileCache` instance to cache the metadata (such as modification time, size, and ``is_dir``) of remote files. Methods that iterate over the contents of a directory, such as :meth:`FileCache.iterdir` and :meth:`FileCache.iterdir_metadata`, will also populate the metadata cache, making future retrievals more efficient. .. _cache_mp: Multi-Processor Safety ---------------------- When more than one process is using the same cache, race conditions are possible if they attempt to retrieve the same file at the same time. By default, a :class:`FileCache` instance will use multiprocessor-safe locking for 'named' (and thus possibly shared) caches to protect the cache from concurrent writes. If a named cache is guaranteed to not to be shared between multiple processes, then this can be disabled by passing the ``mp_safe`` argument to the :class:`FileCache` constructor. If ``mp_safe`` is ``False``, then the :class:`FileCache` instance will not use multiprocessor-safe locking, thus improving performance. On the other hand, if ``mp_safe`` is ``True``, then the :class:`FileCache` instance will always use multiprocessor-safe locking, even for unnamed caches. When multiprocessor-safe locking is used and multiple processes are attempting to retrieve the same file at the same time, the first process to lock the file will be able to retrieve it, while the other processes will wait for the lock to be released. The lock is released when the file is retrieved successfully or when an error occurs. If the retrieving process crashes or is killed and the lock is not successfully released, then it is possible for the waiting processes to wait forever. To avoid this, a timeout can be specified by passing the ``lock_timeout`` argument to the :class:`FileCache` constructor or to the individual method. However, caution should be exercised because if the timeout is too short and file retrieval is slow, then the waiting processes may time out and raise a ``TimeoutError`` prematurely. .. _cache_parallelism: Parallelism ----------- Many operations in :class:`FileCache` are performed in parallel using multiple threads if all URLs are provided at the same time. These include :meth:`FileCache.retrieve`, :meth:`FileCache.upload`, :meth:`FileCache.exists`, :meth:`FileCache.is_dir`, :meth:`FileCache.modification_time`, and :meth:`FileCache.unlink`. The number of threads to use can be specified by passing the ``nthreads`` argument to the :class:`FileCache` constructor or to the individual method. The default value is 8. The number of threads is limited by the number of files being retrieved, the number of CPUs, and the speed of the source or network. Increasing the number of threads beyond what the system can handle may reduce performance. .. _cache_url_translation: URL Translation --------------- For advanced users, two mapping systems are provided. The first translates URLs into URLs, and the second translates URLs into local paths. URL to URL Translation ^^^^^^^^^^^^^^^^^^^^^^ :class:`FileCache` is particularly useful when the same data may be available from multiple sources. For example, a dataset may be available for direct download from a website as well as from a cloud storage bucket. Depending on the particular situation, it may be desirable to choose one source over the other. Assuming the data is laid out identically, all that is required is to change the prefix to the URL, for example from ``https://data.com/data`` to ``gs://my-bucket/data``, and all operations will work as expected. This change can be done using standard Python string manipulation functions. For example: .. code-block:: python data_source = os.environ.get('DATA_SOURCE', 'https://data.com/data') url = f'{data_source}/data/file.txt' However, if the data is laid out differently, additional logic is required to determine the correct URL. The :class:`FileCache` constructor (and each method) accepts a list of functions that are used to translate URLs into URLs. Each function takes three arguments: the scheme, the remote, and the path. The functions are called in order until one returns a URL, or it falls through to the default (which does nothing). This allows the programmer to write code for one particular data source, and then use the same code for other data sources with different layouts. See :ref:`example_url_to_url` for a usage example. URL to Path Translation ^^^^^^^^^^^^^^^^^^^^^^^ Normally, the local directory structure for a cache mirrors that of the remote source. For example, in the cache named ``"global"``, a file retrieved from ``https://data.com/data/file11.txt`` will be stored in ``/_filecache_global/http_data_com/data/file11.txt``. The exact layout of the cache is normally irrelevant, as :class:`FileCache` methods automatically translate URLs into local paths. In obscure cases, it may be desireable to store the data in a different hierarchy. For example, this could be useful if the user is expected to manually inspect the cache directory, but the layout of the remote source is confusing for some reason. In this case, a mapping function can be used to translate the URL into a local path with a more user-friendly format. The user-provided translator function(s) takes five arguments: the scheme, the remote, the path, the cache directory, and the cache subdirectory. It returns a string or `Path` object giving the new absolute path of the cached file, or ``None`` if no translation is desired. This translation is performed on the original URL, not the URL generated by a URL to URL translator, if any. The default translator function is: .. code-block:: python def default_url_to_path(scheme, remote, path, cache_dir, cache_subdir): if scheme == 'file': return Path(path) return cache_dir / cache_subdir / path See :ref:`example_url_to_path` for a usage example. Basic FileCache Operations ************************** Following are the basic operations that can be performed on a :class:`FileCache` instance. Other properties and methods are available. See :ref:`module_file_cache` for more details. Check if a File Exists ---------------------- The :meth:`FileCache.exists` method is used to check if a file exists in the cache or on the remote source. It takes a URL as an argument and returns ``True`` if the file exists, ``False`` otherwise. .. code-block:: python exists = fc.exists('https://data.com/data/file.txt') Multiple URLs, including from different sources, can be checked at the same time by passing a list of URLs: .. code-block:: python exists = fc.exists(['https://data.com/data/file1.txt', 'gs://my-bucket/data/file2.txt']) In this case, the returned list of booleans will be in the same order as the input list of URLs. If any kind of exception is encountered, the return value for that file will be ``False``. In this way, it is more accurate to say that the operation checks for the accessibility of a file rather than its existence. For example, if the user does not have access to a remote source, a file there will be reported as non-existent even if it actually exists. By default, :meth:`FileCache.exists` first checks the local cache for the file. If the file is found there, it is reported as existing. If the file is not found there, it is checked on the remote source. If the local cache and remote source are out of sync (for example the remote copy of the file was deleted outside of the :class:`FileCache` ecosystem), then the file could be reported as existing even though it is only the cached version that exists. The ``bypass_cache`` argument can be used to override this behavior and check the remote source directly without checking the local cache. Retrieve a File --------------- The :meth:`FileCache.retrieve` method is used to retrieve a file from a remote source. It takes a URL as an argument and returns a local path to the downloaded file. .. code-block:: python path = fc.retrieve('https://data.com/data/file.txt') Multiple URLs, including from different sources, can be retrieved at the same time by passing a list of URLs: .. code-block:: python paths = fc.retrieve(['https://data.com/data/file1.txt', 'gs://my-bucket/data/file2.txt']) In this case, the returned list of paths will be in the same order as the input list of URLs. By default, if downloading any of the files results in a failure of some kind, the entire operation will fail and an appropriate exception will be raised (when retrieving multiple files, some, but not all, of the files may have been successfully downloaded before the exception was encountered). This can be overridden by setting the ``exception_on_fail`` argument to ``False``. In this case, the returned list of paths will contain a mixture of `Path` objects and `Exception` objects, where each `Exception` object reports the reason for the failure of that file. Files can also be retrieved and opened in a single operation by using :meth:`FileCache.open` as a context manager. .. code-block:: python with fc.open('https://data.com/data/file.txt') as f: content = f.read() Get the Modification Time of a File ----------------------------------- The :meth:`FileCache.modification_time` method is used to get the modification time of a file. It takes a URL as an argument and returns the modification time of the file *on the remote source* as a UNIX timestamp. The local cache version of the file is ignored. .. code-block:: python mtime = fc.modification_time('https://data.com/data/file.txt') If the file does not exist or cannot be accessed, an exception is raised by default. When called with ``exception_on_fail=False``, those failures are returned as :class:`Exception` instances instead of being raised. If the remote exists but no modification time is available, ``None`` is returned. Multiple URLs, including from different sources, can be checked at the same time by passing a list of URLs: .. code-block:: python mtimes = fc.modification_time(['https://data.com/data/file1.txt', 'gs://my-bucket/data/file2.txt']) In this case, the returned list of modification times will be in the same order as the input list of URLs. If ``cache_metadata`` is ``True`` for the :class:`FileCache` instance, then the modification time is retrieved from the metadata cache if possible. This could result in an erroneous result if the remote file changed during the execution of the program and lifetime of the cache. The remote source can be queried directly by passing ``bypass_cache=True`` to the :meth:`FileCache.modification_time` method. Upload a File ------------- The :meth:`FileCache.upload` method is used to upload a local copy of a file to a remote source. It takes a URL as an argument and returns the local path of the file that was uploaded. .. code-block:: python path = fc.upload('https://data.com/data/file.txt') Multiple files, including for different sources, can be uploaded at the same time by passing a list of URLs: .. code-block:: python paths = fc.upload(['https://data.com/data/file1.txt', 'gs://my-bucket/data/file2.txt']) In this case, the returned list of paths will be in the same order as the input list of URLs. By default, if uploading any of the files results in a failure of some kind, the entire operation will fail and an appropriate exception will be raised (when uploading multiple files, some, but not all, of the files may have been successfully uploaded before the exception was encountered). This can be overridden by setting the ``exception_on_fail`` argument to ``False``. In this case, the returned list of paths will contain a mixture of `Path` objects and `Exception` objects, where each `Exception` object reports the reason for the failure of that file. Files can also be opened, written, and uploaded in a single operation by using :meth:`FileCache.open` as a context manager. .. code-block:: python with fc.open('https://data.com/data/file.txt', 'w') as f: f.write('Hello, World!') # File is automatically uploaded when the context manager for the file handle exits Deleting a File --------------- The :meth:`FileCache.unlink` method is used to delete a file from the cache and the remote source. It takes a URL as an argument and returns the local path of the file that was deleted. .. code-block:: python path = fc.unlink('https://data.com/data/file.txt') Multiple files, including from different sources, can be deleted at the same time by passing a list of URLs: .. code-block:: python paths = fc.unlink(['https://data.com/data/file1.txt', 'gs://my-bucket/data/file2.txt']) In this case, the returned list of paths will be in the same order as the input list of URLs. By default, if deleting any of the files results in a failure of some kind, the entire operation will fail and an appropriate exception will be raised (when deleting multiple files, some, but not all, of the files may have been successfully deleted before the exception was encountered). This can be overridden by setting the ``exception_on_fail`` argument to ``False``. In this case, the returned list of paths will contain a mixture of `Path` objects and `Exception` objects, where each `Exception` object reports the reason for the failure of that file. FCPath ****** While the :class:`FileCache` class provides direct oversight of files in a cache along with the necessary methods to manipulate them, it is often more convenient to operate on URLs and local files using the simpler syntax provided by the Python `pathlib.Path` class. The :class:`FCPath` class is a reimplementation of the `Path` class to support remote access using an associated :class:`FileCache`. It supports all of the common path operations provided by `Path`, as well as the ability to operate on both URLs and local files. Every :class:`FCPath` instance must be associated with a :class:`FileCache` instance. An :class:`FCPath` instance can be created by using the :meth:`FileCache.new_path` method, in which case the association is automatic. Alternatively, an :class:`FCPath` instance can be created by calling the :class:`FCPath` constructor directly. In this case, by default a new :class:`FileCache` instance is created with default parameters (and thus is named ``"global"`` and is permanent and not time-sensitive). A specific :class:`FileCache` instance can be specified by passing it to the :class:`FCPath` constructor using the ``filecache`` keyword argument. If no :class:`FileCache` instance is specified, the default ``"global"`` cache is used. Other parameters are available for the :class:`FCPath` constructor or the :meth:`FileCache.new_path` method (such as ``nthreads`` and ``url_to_url``) that will override the standard values defined in the associated :class:`FileCache` instance. When :class:`FCPath` instances are combined by using the ``/`` operator, the :class:`FileCache` association is inherited from the left-hand side :class:`FCPath` instance. For example: .. code-block:: python root = FCPath('https://data.com') # Defaults to the "global" cache path = root / 'data'/ 'file.txt' # The path is 'https://data.com/data/file.txt' in the "global" cache with FileCache(None, time_sensitive=True) as fc: root = FCPath('https://data.com', filecache=fc) path = root / 'data' / 'file.txt' # The path is 'https://data.com/data/file.txt' in the specified cache Passing an FCPath to the :class:`FCPath` constructor will create a new :class:`FCPath` instance with the same :class:`FileCache` association. This means that if you write a function that takes a string, `Path`, or :class:`FCPath`, you can effectively cast any input into a :class:`FCPath` by calling the constructor, without risk of losing the :class:`FileCache` association: .. code-block:: python def my_function(path: str | Path | FCPath): fc_path = FCPath(path) # fc_path is now a :class:`FCPath` instance with the same :class:`FileCache` # association as the input, if available, otherwise "global" All :class:`FileCache` methods can be called on :class:`FCPath` instances. For example: .. code-block:: python fc = FileCache(None) fc_path = FCPath('https://data.com/data/file.txt', filecache=fc) exists = fc_path.exists() # is equivalent to... exists = fc.exists('https://data.com/data/file.txt') All :class:`FCPath` methods that are versions of :class:`FileCache` methods take an optional ``sub_path`` argument that is a path relative to the :class:`FCPath` instance. This allows a single :class:`FCPath` instance to be used to reference a parent directory, and then the sub-path can be used for specific files within that directory. This is effectively shorthand for the ``/`` operator. For example: .. code-block:: python root = FCPath('https://data.com/data') exists = root.exists('file.txt') # is equivalent to... fc_path = FCPath('https://data.com/data/file.txt') exists = fc_path.exists() Other parameters are also available (such as ``nthreads`` and ``url_to_url``) that will override the standard values defined in the associated :class:`FileCache` instance. The override hierarchy is: 1. Parameters passed to an individual :class:`FCPath` method such as :meth:`FCPath.exists`, which override... 2. Parameters passed to the :class:`FCPath` constructor or the :meth:`FileCache.new_path` method, which override... 3. Parameters passed to the :class:`FileCache` constructor for the associated :class:`FileCache` instance, which override... 4. Default values for :class:`FileCache` .. _cache_best_practices: Best Practices ************** - Segment data by purpose, lifetime, and sharing. Create a separate cache for each logical grouping. - When creating a new :class:`FCPath`, be sure to specify an appropriate `filecache` or use the :meth:`FileCache.new_path` method. Otherwise it will always use the default global cache. - Do not use ``time_sensitive=True`` unless you expect the remote source to change during the program's execution. - Use ``cache_metadata=True`` when possible, especially if you are iterating over the contents of a directory before downloading files.