Introduction
============

``filecache`` is a Python module that abstracts away the location where files used or
generated by a program are stored. Files can be on the local file system (``/`` or
``file://``), in Google Cloud Storage (``gs://``), on Amazon Web Services S3 (``s3://``),
or on a webserver (``http://`` or ``https://``). There is also a fake remote that can be
used to simulate a remote using the local filesystem as a file repository (``fake://``).
The fundamental concept is that of an isolated cache defined by a :class:`FileCache`
instance.


Creating a FileCache Instance
*****************************

There are two fundamental ways to use :class:`FileCache`. The first is to create a
:class:`FileCache` instance directly. The second is to use :class:`FileCache` as a context manager.

Example of direct creation:

.. code-block:: python

    from filecache import FileCache

    fc = FileCache()
    # perform operations on fc
    # fc lives forever

Example of context manager:

.. code-block:: python

    from filecache import FileCache

    with FileCache() as fc:
        # perform operations on fc
    # fc is deleted here and the cache may be cleaned up (see below)


FileCache Attributes
********************

A :class:`FileCache` instance contains a variety of attributes that specify the location
and behavior of the cache. These attributes are described below.

.. _cache_location:

Cache Location
--------------

The location of the cache on the local file system is specified by a combination of a base
directory and a subdirectory name.

The Base Directory
^^^^^^^^^^^^^^^^^^

By default, the base directory is the standard temporary directory for the operating
system. This is typically ``C:\TEMP``, ``C:\TMP``, ``\TEMP``, or ``\TMP`` on Windows and
``/tmp``, ``/var/tmp``, or ``/usr/tmp`` on other platforms. The choice of temporary
directory can be overridden with the ``TMPDIR``, ``TEMP``, or ``TMP`` environment
variables, but this will affect other modules that use the system temporary directory such
as ``tempdir``. To specify a different default base directory for all :class:`FileCache`
instances without affecting the system temporary directory, set the
``FILECACHE_CACHE_ROOT`` environment variable. Finally, to provide a different base
directory for a particular :class:`FileCache` instance, pass the ``cache_root`` argument
to the :class:`FileCache` constructor. The ``cache_root`` argument can be a string or
`Path` object.

The choice of base directory should be considered carefully. Depending on the operating
system, the system temporary directory may be on its own filesystem, or be part of the
root filesystem. Either way, the filesystem may not have enough free space to store the
cache. In addition, on some operating systems the temporary directory is purged on reboot,
making it unsuitable for long-term storage of cached files.

The Subdirectory Name
^^^^^^^^^^^^^^^^^^^^^

The subdirectory name is specified by the ``cache_name`` argument to the :class:`FileCache`
constructor. The default value of ``cache_name`` is ``"global"``, which will result in the
subdirectory name ``_filecache_global``. If a different name is specified, the subdirectory
name will be ``_filecache_<cache_name>``. Finally, if ``cache_name`` is ``None``, the cache will
be stored in a subdirectory with the prefix ``_filecache_`` followed by a globally unique
identifier.

The choice of cache name primarily affects sharing. If more than one program, or instance
of the same program, is going to use the cached data (i.e. they download and use the same
files from the same remote source), then they can share the same cache by using a common
cache name. This increases performance and reduces disk space usage by avoiding duplicate
downloads of the same files. The default ``"global"`` cache is a convenient place to store
files that are needed by multiple programs if there is no need to otherwise segregate
them. However, if a uniquely-named cache is used, then sharing is not possible because the
cache name is not available to other programs.

Another consideration is the lifetime and maintenance of the cache. In addition to
everything else, a cache can be considered a grouping of files that all have the same
basic purpose. For some, the purpose may be long-term storage of unchanging remote data.
For others, the purpose may be more ephemeral, downloading files for a single operation
and then no longer needing them after the operation is complete; see the section on
`cache_lifetime`_ for details on automated cache maintenance. Manual maintenance of a
cache is also possible once you know its base directory and subdirectory name. For
example, you can clear the cache simply by deleting the cache directory. If caches are
appropriately named and segregated, then you can clear the cache for one type of data
without affecting other types.


.. _cache_lifetime:

Cache Lifetime
--------------

Caches are either permanent or ephemeral. A permanent cache will never be deleted by the
:class:`FileCache` code and will persist on the local disk until it is manually deleted
(using ``rm`` or equivalent), explicitly deleted by the program (by calling the
:meth:`FileCache.delete_cache` method), or deleted by the operating system on reboot, as
discussed above. In contrast, an ephemeral cache will be deleted when the
:class:`FileCache` instance is deleted, which is at the exit of a context manager, when
the :class:`FileCache` instance is deleted explicitly, or on program exit.

By default, named caches are permanent and unnamed caches (``cache_name`` is ``None``) are
ephemeral. This can be overridden by passing the ``delete_on_exit`` argument to the
:class:`FileCache` constructor. If ``delete_on_exit`` is ``True``, the cache will be
deleted whenever the :class:`FileCache` instance is deleted. Be careful setting this to
``True`` on a named cache, as it will delete that cache on program exit while other
programs may still be using it. In general, it is recommended to avoid this argument except
under very specific circumstances.


.. _cache_time:

Preserving File Timestamps
--------------------------

Files on the local filesystem usually have multiple timestamps associated with them, such
as the modification time, access time, and creation time. However, files on remote sources
may or may not have similar timestamps. By default, for efficiency and generality,
timestamps are not preserved when downloading or uploading files. However, this can be
changed by setting the ``time_sensitive`` argument to the :class:`FileCache` constructor.
If ``time_sensitive`` is ``True``, then:

- When a file is retrieved, the modification from the source location, if available, is
  set on the local copy. If a local copy already exists, the times on both copies are
  compared and the local copy is updated if the source is newer.
- When a file is uploaded, the modification timestamp on the remote copy is set to that of
  the local copy, if possible.

By default, when ``time_sensitive`` is ``True``, the modification timestamps on both local
and remote copies are queried each time they are needed. Thus, if the remote copy is
changed by another program during the lifetime of the :class:`FileCache` instance, when
:meth:`FileCache.retrieve` is called, the local copy will be updated with the contents of
the new remote copy. However, if the remote copy is guaranteed not to change during the
lifetime of the :class:`FileCache` instance, then this extra network traffic can be avoided
by setting the ``cache_metadata`` argument to ``True``. This will cause the
:class:`FileCache` instance to cache the metadata (such as modification time, size, and
``is_dir``) of remote files. Methods that iterate over the contents of a directory, such
as :meth:`FileCache.iterdir` and :meth:`FileCache.iterdir_metadata`, will also populate
the metadata cache, making future retrievals more efficient.


.. _cache_mp:

Multi-Processor Safety
----------------------

When more than one process is using the same cache, race conditions are possible if they
attempt to retrieve the same file at the same time. By default, a :class:`FileCache`
instance will use multiprocessor-safe locking for 'named' (and thus possibly shared)
caches to protect the cache from concurrent writes. If a named cache is guaranteed to not
to be shared between multiple processes, then this can be disabled by passing the
``mp_safe`` argument to the :class:`FileCache` constructor. If ``mp_safe`` is ``False``,
then the :class:`FileCache` instance will not use multiprocessor-safe locking, thus
improving performance. On the other hand, if ``mp_safe`` is ``True``, then the
:class:`FileCache` instance will always use multiprocessor-safe locking, even for unnamed
caches.

When multiprocessor-safe locking is used and multiple processes are attempting to retrieve
the same file at the same time, the first process to lock the file will be able to
retrieve it, while the other processes will wait for the lock to be released. The lock is
released when the file is retrieved successfully or when an error occurs. If the
retrieving process crashes or is killed and the lock is not successfully released, then it
is possible for the waiting processes to wait forever. To avoid this, a timeout can be
specified by passing the ``lock_timeout`` argument to the :class:`FileCache` constructor
or to the individual method. However, caution should be exercised because if the timeout
is too short and file retrieval is slow, then the waiting processes may time out and raise
a ``TimeoutError`` prematurely.


.. _cache_parallelism:

Parallelism
-----------

Many operations in :class:`FileCache` are performed in parallel using multiple threads if
all URLs are provided at the same time. These include :meth:`FileCache.retrieve`,
:meth:`FileCache.upload`, :meth:`FileCache.exists`, :meth:`FileCache.is_dir`,
:meth:`FileCache.modification_time`, and :meth:`FileCache.unlink`. The number of threads
to use can be specified by passing the ``nthreads`` argument to the :class:`FileCache`
constructor or to the individual method. The default value is 8. The number of threads is
limited by the number of files being retrieved, the number of CPUs, and the speed of the
source or network. Increasing the number of threads beyond what the system can handle may
reduce performance.


.. _cache_url_translation:

URL Translation
---------------

For advanced users, two mapping systems are provided. The first translates URLs into URLs,
and the second translates URLs into local paths.

URL to URL Translation
^^^^^^^^^^^^^^^^^^^^^^

:class:`FileCache` is particularly useful when the same data may be available from
multiple sources. For example, a dataset may be available for direct download from a
website as well as from a cloud storage bucket. Depending on the particular situation, it
may be desirable to choose one source over the other. Assuming the data is laid out
identically, all that is required is to change the prefix to the URL, for example from
``https://data.com/data`` to ``gs://my-bucket/data``, and all operations will work as
expected. This change can be done using standard Python string manipulation functions. For
example:

.. code-block:: python

    data_source = os.environ.get('DATA_SOURCE', 'https://data.com/data')
    url = f'{data_source}/data/file.txt'

However, if the data is laid out differently, additional logic is required to determine
the correct URL. The :class:`FileCache` constructor (and each method) accepts a list of
functions that are used to translate URLs into URLs. Each function takes three arguments:
the scheme, the remote, and the path. The functions are called in order until one returns
a URL, or it falls through to the default (which does nothing). This allows the programmer
to write code for one particular data source, and then use the same code for other data
sources with different layouts.

See :ref:`example_url_to_url` for a usage example.

URL to Path Translation
^^^^^^^^^^^^^^^^^^^^^^^

Normally, the local directory structure for a cache mirrors that of the remote source. For
example, in the cache named ``"global"``, a file retrieved from
``https://data.com/data/file11.txt`` will be stored in
``<cache_root>/_filecache_global/http_data_com/data/file11.txt``. The exact layout of the
cache is normally irrelevant, as :class:`FileCache` methods automatically translate URLs
into local paths. In obscure cases, it may be desireable to store the data in a different
hierarchy. For example, this could be useful if the user is expected to manually inspect
the cache directory, but the layout of the remote source is confusing for some reason. In
this case, a mapping function can be used to translate the URL into a local path with a
more user-friendly format.

The user-provided translator function(s) takes five arguments: the scheme, the remote, the
path, the cache directory, and the cache subdirectory. It returns a string or
`Path` object giving the new absolute path of the cached file, or ``None`` if no
translation is desired. This translation is performed on the original URL, not the URL
generated by a URL to URL translator, if any.

The default translator function is:

.. code-block:: python

    def default_url_to_path(scheme, remote, path, cache_dir, cache_subdir):
        if scheme == 'file':
            return Path(path)
        return cache_dir / cache_subdir / path

See :ref:`example_url_to_path` for a usage example.


Basic FileCache Operations
**************************

Following are the basic operations that can be performed on a :class:`FileCache` instance. Other
properties and methods are available. See :ref:`module_file_cache` for more details.

Check if a File Exists
----------------------

The :meth:`FileCache.exists` method is used to check if a file exists in the cache or on
the remote source. It takes a URL as an argument and returns ``True`` if the file exists,
``False`` otherwise.

.. code-block:: python

    exists = fc.exists('https://data.com/data/file.txt')

Multiple URLs, including from different sources, can be checked at the same time by
passing a list of URLs:

.. code-block:: python

    exists = fc.exists(['https://data.com/data/file1.txt',
                        'gs://my-bucket/data/file2.txt'])

In this case, the returned list of booleans will be in the same order as the input list of
URLs.

If any kind of exception is encountered, the return value for that file will be
``False``. In this way, it is more accurate to say that the operation checks for the
accessibility of a file rather than its existence. For example, if the user does not have
access to a remote source, a file there will be reported as non-existent even if it
actually exists.

By default, :meth:`FileCache.exists` first checks the local cache for the file. If the
file is found there, it is reported as existing. If the file is not found there, it is
checked on the remote source. If the local cache and remote source are out of sync (for
example the remote copy of the file was deleted outside of the :class:`FileCache`
ecosystem), then the file could be reported as existing even though it is only the cached
version that exists. The ``bypass_cache`` argument can be used to override this behavior
and check the remote source directly without checking the local cache.


Retrieve a File
---------------

The :meth:`FileCache.retrieve` method is used to retrieve a file from a remote source.
It takes a URL as an argument and returns a local path to the downloaded file.

.. code-block:: python

    path = fc.retrieve('https://data.com/data/file.txt')

Multiple URLs, including from different sources, can be retrieved at the same time by
passing a list of URLs:

.. code-block:: python

    paths = fc.retrieve(['https://data.com/data/file1.txt',
                         'gs://my-bucket/data/file2.txt'])

In this case, the returned list of paths will be in the same order as the input list of URLs.

By default, if downloading any of the files results in a failure of some kind, the entire
operation will fail and an appropriate exception will be raised (when retrieving multiple
files, some, but not all, of the files may have been successfully downloaded before the
exception was encountered). This can be overridden by setting the ``exception_on_fail``
argument to ``False``. In this case, the returned list of paths will contain a mixture of
`Path` objects and `Exception` objects, where each `Exception` object
reports the reason for the failure of that file.

Files can also be retrieved and opened in a single operation by using
:meth:`FileCache.open` as a context manager.

.. code-block:: python

    with fc.open('https://data.com/data/file.txt') as f:
        content = f.read()


Get the Modification Time of a File
-----------------------------------

The :meth:`FileCache.modification_time` method is used to get the modification time of a
file. It takes a URL as an argument and returns the modification time of the file *on the
remote source* as a UNIX timestamp. The local cache version of the file is ignored.

.. code-block:: python

    mtime = fc.modification_time('https://data.com/data/file.txt')

If the file does not exist or cannot be accessed, an exception is raised by
default. When called with ``exception_on_fail=False``, those failures are
returned as :class:`Exception` instances instead of being raised. If the
remote exists but no modification time is available, ``None`` is returned.
Multiple URLs, including from different sources, can be checked at the same time by
passing a list of URLs:

.. code-block:: python

    mtimes = fc.modification_time(['https://data.com/data/file1.txt',
                                   'gs://my-bucket/data/file2.txt'])

In this case, the returned list of modification times will be in the same order as the
input list of URLs.

If ``cache_metadata`` is ``True`` for the :class:`FileCache` instance, then the
modification time is retrieved from the metadata cache if possible. This could result in
an erroneous result if the remote file changed during the execution of the program and
lifetime of the cache. The remote source can be queried directly by passing
``bypass_cache=True`` to the :meth:`FileCache.modification_time` method.


Upload a File
-------------

The :meth:`FileCache.upload` method is used to upload a local copy of a file to a remote
source. It takes a URL as an argument and returns the local path of the file that was uploaded.

.. code-block:: python

    path = fc.upload('https://data.com/data/file.txt')

Multiple files, including for different sources, can be uploaded at the same time by
passing a list of URLs:

.. code-block:: python

    paths = fc.upload(['https://data.com/data/file1.txt',
                       'gs://my-bucket/data/file2.txt'])

In this case, the returned list of paths will be in the same order as the input list of
URLs.

By default, if uploading any of the files results in a failure of some kind, the entire
operation will fail and an appropriate exception will be raised (when uploading multiple
files, some, but not all, of the files may have been successfully uploaded before the
exception was encountered). This can be overridden by setting the ``exception_on_fail``
argument to ``False``. In this case, the returned list of paths will contain a mixture of
`Path` objects and `Exception` objects, where each `Exception` object
reports the reason for the failure of that file.

Files can also be opened, written, and uploaded in a single operation by using
:meth:`FileCache.open` as a context manager.

.. code-block:: python

    with fc.open('https://data.com/data/file.txt', 'w') as f:
        f.write('Hello, World!')
    # File is automatically uploaded when the context manager for the file handle exits


Deleting a File
---------------

The :meth:`FileCache.unlink` method is used to delete a file from the cache and the remote
source. It takes a URL as an argument and returns the local path of the file that was
deleted.

.. code-block:: python

    path = fc.unlink('https://data.com/data/file.txt')

Multiple files, including from different sources, can be deleted at the same time by
passing a list of URLs:

.. code-block:: python

    paths = fc.unlink(['https://data.com/data/file1.txt',
                       'gs://my-bucket/data/file2.txt'])

In this case, the returned list of paths will be in the same order as the input list of URLs.

By default, if deleting any of the files results in a failure of some kind, the entire
operation will fail and an appropriate exception will be raised (when deleting multiple
files, some, but not all, of the files may have been successfully deleted before the
exception was encountered). This can be overridden by setting the ``exception_on_fail``
argument to ``False``. In this case, the returned list of paths will contain a mixture of
`Path` objects and `Exception` objects, where each `Exception` object
reports the reason for the failure of that file.


FCPath
******

While the :class:`FileCache` class provides direct oversight of files in a cache along
with the necessary methods to manipulate them, it is often more convenient to operate on
URLs and local files using the simpler syntax provided by the Python `pathlib.Path`
class. The :class:`FCPath` class is a reimplementation of the `Path` class to
support remote access using an associated :class:`FileCache`. It supports all of the
common path operations provided by `Path`, as well as the ability to operate on
both URLs and local files.

Every :class:`FCPath` instance must be associated with a :class:`FileCache` instance. An
:class:`FCPath` instance can be created by using the :meth:`FileCache.new_path` method, in which
case the association is automatic. Alternatively, an :class:`FCPath` instance can be created by
calling the :class:`FCPath` constructor directly. In this case, by default a new :class:`FileCache`
instance is created with default parameters (and thus is named ``"global"`` and is permanent and
not time-sensitive). A specific :class:`FileCache` instance can be specified by passing it to the
:class:`FCPath` constructor using the ``filecache`` keyword argument. If no :class:`FileCache`
instance is specified, the default ``"global"`` cache is used.

Other parameters are available for the :class:`FCPath` constructor or the
:meth:`FileCache.new_path` method (such as ``nthreads`` and ``url_to_url``) that will
override the standard values defined in the associated :class:`FileCache` instance.

When :class:`FCPath` instances are combined by using the ``/`` operator, the
:class:`FileCache` association is inherited from the left-hand side :class:`FCPath` instance.

For example:

.. code-block:: python

    root = FCPath('https://data.com')  # Defaults to the "global" cache
    path = root / 'data'/ 'file.txt'
    # The path is 'https://data.com/data/file.txt' in the "global" cache

    with FileCache(None, time_sensitive=True) as fc:
        root = FCPath('https://data.com', filecache=fc)
        path = root / 'data' / 'file.txt'
        # The path is 'https://data.com/data/file.txt' in the specified cache

Passing an FCPath to the :class:`FCPath` constructor will create a new :class:`FCPath`
instance with the same :class:`FileCache` association. This means that if you write a
function that takes a string, `Path`, or :class:`FCPath`, you can effectively cast
any input into a :class:`FCPath` by calling the constructor, without risk of losing the
:class:`FileCache` association:

.. code-block:: python

    def my_function(path: str | Path | FCPath):
        fc_path = FCPath(path)
        # fc_path is now a :class:`FCPath` instance with the same :class:`FileCache`
        # association as the input, if available, otherwise "global"

All :class:`FileCache` methods can be called on :class:`FCPath` instances. For example:

.. code-block:: python

    fc = FileCache(None)
    fc_path = FCPath('https://data.com/data/file.txt', filecache=fc)
    exists = fc_path.exists()

    # is equivalent to...
    exists = fc.exists('https://data.com/data/file.txt')

All :class:`FCPath` methods that are versions of :class:`FileCache` methods take an
optional ``sub_path`` argument that is a path relative to the :class:`FCPath` instance. This
allows a single :class:`FCPath` instance to be used to reference a parent directory, and then
the sub-path can be used for specific files within that directory. This is effectively shorthand
for the ``/`` operator. For example:

.. code-block:: python

    root = FCPath('https://data.com/data')
    exists = root.exists('file.txt')

    # is equivalent to...
    fc_path = FCPath('https://data.com/data/file.txt')
    exists = fc_path.exists()

Other parameters are also available (such as ``nthreads`` and ``url_to_url``) that will
override the standard values defined in the associated :class:`FileCache` instance. The
override hierarchy is:

1. Parameters passed to an individual :class:`FCPath` method such as :meth:`FCPath.exists`,
   which override...
2. Parameters passed to the :class:`FCPath` constructor or the :meth:`FileCache.new_path`
   method, which override...
3. Parameters passed to the :class:`FileCache` constructor for the associated
   :class:`FileCache` instance, which override...
4. Default values for :class:`FileCache`


.. _cache_best_practices:

Best Practices
**************

- Segment data by purpose, lifetime, and sharing. Create a separate cache for each logical
  grouping.
- When creating a new :class:`FCPath`, be sure to specify an appropriate `filecache` or use the
  :meth:`FileCache.new_path` method. Otherwise it will always use the default global cache.
- Do not use ``time_sensitive=True`` unless you expect the remote source to change during the
  program's execution.
- Use ``cache_metadata=True`` when possible, especially if you are iterating over the contents
  of a directory before downloading files.