Your code reads some data, processes it, and uses too much memory. In order to reduce memory usage, you need to figure out:
That's exactly what Fil will help you find. Fil an open source memory profiler designed for data processing applications written in Python, and includes native support for Jupyter.
At the moment it only runs on Linux and macOS, and while it supports threading, it does not yet support multiprocessing or multiple processes in general.
"Within minutes of using your tool, I was able to identify a major memory bottleneck that I never would have thought existed. The ability to track memory allocated via the Python interface and also C allocation is awesome, especially for my NumPy / Pandas programs."
For more information, including an example of the output, see https://pythonspeed.com/products/filmemoryprofiler/
There are two distinct patterns of Python usage, each with its own source of memory problems.
In a long-running server, memory usage can grow indefinitely due to memory leaks. That is, some memory is not being freed.
tracemallocand Pympler can tell you which objects are leaking and what is preventing them from being leaked.
Fil, however, is not aimed at memory leaks, but at the other use case: data processing applications. These applications load in data, process it somehow, and then finish running.
The problem with these applications is that they can, on purpose or by mistake, allocate huge amounts of memory. It might get freed soon after, but if you allocate 16GB RAM and only have 8GB in your computer, the lack of leaks doesn't help you.
Fil will therefore tell you, in an easy to understand way:
tracemalloconly does Python memory APIs).
This allows you to optimize that code in a variety of ways.
Assuming you're on macOS or Linux, and are using Python 3.6 or later, you can use either Conda or pip (or any tool that is pip-compatible and can install
To install on Conda:
$ conda install -c conda-forge filprofiler
To install the latest version of Fil you'll need Pip 19 or newer. You can check like this:
$ pip --version pip 19.3.0
If you're using something older than v19, you can upgrade by doing:
$ pip install --upgrade pip
If that doesn't work, try running your code in a virtualenv:
$ python3 -m venv venv/ $ . venv/bin/activate (venv) $ pip install --upgrade pip
Assuming you have a new enough version of pip:
$ pip install filprofiler
To measure peak memory usage of some code in Jupyter you need to do three things:
%%filprofilemagic to the top of the cell with the code you wish to profile.
Instead of doing:
$ python yourscript.py --input-file=yourfile
$ fil-profile run yourscript.py --input-file=yourfile
And it will generate a report and automatically try to open it in for you in a browser.
Reports will be stored in the
fil-result/ directory in your current working directory.
If your program is usually run as
python -m yourapp.yourmodule --args, you can do that with Fil too:
$ fil-profile run -m yourapp.yourmodule --args
As of version 0.11, you can use
python -m to run Fil:
$ python -m filprofiler run yourscript.py --input-file=yourfile
As of version 2021.04.2, you can disable opening reports in a browser by using the
--no-browser option (see
fil-profile --help for details).
If you want to serve the report files from a static directory from a web server, you can use
python -m http.server.
You can also measure memory usage in part of your program; this requires version 0.15 or later. This requires two steps.
Let's you have some code that does the following:
def main(): config = load_config() result = run_processing(config) generate_report(result)
You only want to get memory profiling for the
You can do so in the code like so:
from filprofiler.api import profile def main(): config = load_config() result = profile(lambda: run_processing(config), "/tmp/fil-result") generate_report(result)
You could also make it conditional, e.g. based on an environment variable:
import os from filprofiler.api import profile def main(): config = load_config() if os.environ.get("FIL_PROFILE"): result = profile(lambda: run_processing(config), "/tmp/fil-result") else: result = run_processing(config) generate_report(result)
You still need to run your program in a special way. If previously you did:
$ python yourscript.py --config=myconfig
Now you would do:
$ filprofiler python yourscript.py --config=myconfig
Notice that you're doing
python, rather than
filprofiler run as you would if you were profiling the full script.
Only functions explicitly called with the
filprofiler.api.profile() will have memory profiling enabled; the rest of the code will run at (close) to normal speed and configuration.
Each call to
profile() will generate a separate report.
The memory profiling report will be written to the directory specified as the output destination when calling
profile(); in or example above that was
Unlike full-program profiling:
profile(), it is your responsibility to ensure each call writes to a unique directory.
New in v0.14 and later: Just run your program under Fil, and it will generate a SVG at the point in time when memory runs out, and then exit with exit code 53:
$ fil-profile run oom.py ... =fil-profile= Wrote memory usage flamegraph to fil-result/2020-06-15T12:37:13.033/out-of-memory.svg
Fil uses three heuristics to determine if the process is close to running out of memory:
mmap()if you expect to be using disk as a backfill for memory.
Sometimes the out-of-memory detection heuristic will kick in too soon, shutting down the program even though in practice it could finish running.
You can disable the heuristic by doing
fil-profile --disable-oom-detection run yourprogram.py.
You've found where memory usage is coming from—now what?
If you're using data processing or scientific computing libraries, I have written a relevant guide to reducing memory usage.
Fil uses the
DYLD_INSERT_LIBRARIES mechanism to preload a shared library at process startup.
This shared library captures all memory allocations and deallocations and keeps track of them.
At the same time, the Python tracing infrastructure (used e.g. by
coverage.py) to figure out which Python callstack/backtrace is responsible for each allocation.
For performance reasons, only the largest allocations are reported, with a minimum of 99% of allocated memory reported. The remaining <1% is highly unlikely to be relevant when trying to reduce usage; it's effectively noise.
In general, Fil will track allocations in threads correctly.
First, if you start a thread via Python, running Python code, that thread will get its own callstack for tracking who is responsible for a memory allocation.
Second, if you start a C thread, the calling Python code is considered responsible for any memory allocations in that thread. This works fine... except for thread pools. If you start a pool of threads that are not Python threads, the Python code that created those threads will be responsible for all allocations created during the thread pool's lifetime.
Therefore, in order to ensure correct memory tracking, Fil disables thread pools in BLAS (used by NumPy), BLOSC (used e.g. by Zarr), OpenMP, and
They are all set to use 1 thread, so calls should run in the calling Python thread and everything should be tracked correctly.
This has some costs:
Fil does this for the whole program when using
When using the Jupyter kernel, anything run with the
%%filprofile magic will have thread pools disabled, but other code should run normally.
Fil will track memory allocated by:
Still not supported, but planned:
mmap(). The semantics are somewhat different than normal allocations or anonymous
mmap(), since the OS can swap it in or out from disk transparently, so supporting this will involve a different kind of resource usage and reporting.
mmap()s created via
/dev/zero(not common, since it's not cross-platform, e.g. macOS doesn't support this).
memfd_create(), a Linux-only mechanism for creating in-memory files.
reallocarray(). These are all rarely used, as far as I can tell.
Copyright 2020 Hyphenated Enterprises LLC
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.