Opened 2 years ago

Closed 2 years ago

#2574 closed defect (fixed)

configurable size of gdal datasets cache in wcst_import

Reported by: Dimitar Misev Owned by: Bang Pham Huu
Priority: major Milestone: 10.0
Component: wcst_import Version: 9.8
Keywords: Cc:
Complexity: Medium

Description

We need to allow configuring the number of open GDAL files that the cache in gdal_util.py holds, because too many files can cause issues with various limits on Linux (e.g. max 1024 open files usually, or too many threads as each gdal.Open seems to create 7 threads).

  • wcst_import.sh needs a new option:
     -c, --gdal-cache-size <size>
       the number of open gdal datasets to keep in cache in order
       to avoid reopening the same files, which can be costly. The
       specified value can be one of: -1 (no limit, cache all files),
       0 (fully disable caching), N (clear the cache whenever it has
       more than N datasets, N should be greater than 0). The
       default value is -1 if this option is not specified.
    

Change History (3)

comment:1 by Dimitar Misev, 2 years ago

At the end of https://doc.rasdaman.org/05_geo-services-guide.html#data-import a new section can be added:

Importing many files
--------------------

When an ingredient contains many paths to be imported, usually more than 1000,
this may lead to hitting some system limits during the import. 

In particular when data is imported with the GDAL driver, wcst_import has a 
cache of open GDAL datasets to avoid reopening files, which is costly. With
too many open GDAL datasets limit on max open files can be reached, which
is often 1024 (see ``ulimit -n``). wcst_import handles this case by clearing
its cache; however, this may degrade import performance, so increasing the 
limit on open files should be considered.

Furthermore, limits on maximum number of threads may be reached as well,
as each open GDAL dataset creates several threads. This will lead to
errors such as ``fork: retry: Resource temporarily unavailable``.
The maximum allowed number can be observed with
``cat /sys/fs/cgroup/pids/user.slice/user-<id>.slice/pids.max``, where
``<id>`` can be found with ``id -u <user>`` for the user with which
wcst_import is executed. Increasing to a larger value, e.g. 4194304,
should solve this issue.

Finally, wcst_import.sh allows to control the gdal cache size with the
``-c, --gdal-cache-size <size>`` option. The
specified value can be one of: -1 (no limit, cache all files),
0 (fully disable caching), N (clear the cache whenever it has
more than N datasets, N should be greater than 0). The
default value is -1 if this option is not specified.

comment:2 by Bang Pham Huu, 2 years ago

code to clear the cache

    def _clear_gdal_dataset_cache(self):
        global _gdal_dataset_cache
        for _, ds in _gdal_dataset_cache.items():
            ds = None
        _gdal_dataset_cache = {}

comment:3 by Bang Pham Huu, 2 years ago

Resolution: fixed
Status: assignedclosed
Note: See TracTickets for help on using tickets.