X-Git-Url: https://wannabe.guru.org/gitweb/?a=blobdiff_plain;f=src%2Fpyutils%2Fparallelize%2Fparallelize.py;h=515d431dea4e4fa8c7c4101a0573d4308990c0e5;hb=278d163705facc2276cd464414fb490ef6af50ab;hp=41d9093735d0c2e5b8d69195e19fa2c31c154b49;hpb=c256f84c53368730ee07c26dc29d3a66456501c0;p=pyutils.git diff --git a/src/pyutils/parallelize/parallelize.py b/src/pyutils/parallelize/parallelize.py index 41d9093..515d431 100644 --- a/src/pyutils/parallelize/parallelize.py +++ b/src/pyutils/parallelize/parallelize.py @@ -2,44 +2,57 @@ # © Copyright 2021-2022, Scott Gasch -"""A decorator to help with dead simple parallelization. When decorated +"""A decorator to help with simple parallelization. When decorated functions are invoked they execute on a background thread, process or remote machine depending on the style of decoration:: - @parallelize # defaults to thread-mode + from pyutils.parallelize import parallelize as p + + @p.parallelize # defaults to thread-mode def my_function(a, b, c) -> int: ...do some slow / expensive work, e.g., an http request - @parallelize(method=Method.PROCESS) + @p.parallelize(method=Method.PROCESS) def my_other_function(d, e, f) -> str: ...do more really expensive work, e.g., a network read - @parallelize(method=Method.REMOTE) + @p.parallelize(method=Method.REMOTE) def my_other_other_function(g, h) -> int: ...this work will be distributed to a remote machine pool This will just work out of the box with `Method.THREAD` (the default) -and `Method.PROCESS` but in otder to use `Method.REMOTE` you need to +and `Method.PROCESS` but in order to use `Method.REMOTE` you need to do some setup work: - 1. To use this stuff you need to hook into :mod:`pyutils.config` - so that this code can see commandline arguments. + 1. To use `@parallelize(method=Method.REMOTE)` with your code you + need to hook your code into :mod:`pyutils.config` to enable + commandline flags from `pyutil` files. You can do this by + either wrapping your main entry point with the + :meth:`pyutils.bootstrap.initialize` decorator or just calling + `config.parse()` early in your program. See instructions in + :mod:`pyutils.bootstrap` and :mod:`pyutils.config` for + more information. 2. You need to create and configure a pool of worker machines. All of these machines should run the same version of Python, ideally in a virtual environment (venv) with the same - Python dependencies installed. Different versions of code - or of the interpreter itself can cause issues with running - cloudpicked code. + Python dependencies installed. See: https://docs.python.org/3/library/venv.html + + .. warning:: - 3. You need an account that can ssh into any / all of these - machines non-interactively and run Python in the aforementioned - virtual environment. This likely means setting up ssh with - key-based authentication. + Different versions of code, libraries, or of the interpreter + itself can cause issues with running cloudpicked code. + + 3. You need an account that can ssh into any / all of these pool + machines non-interactively to perform tasks such as copying + code to the worker machine and running Python in the + aforementioned virtual environment. This likely means setting + up `ssh` / `scp` with key-based authentication. + See: https://www.digitalocean.com/community/tutorials/how-to-set-up-ssh-keys-2 4. You need to tell this parallelization framework about the pool - of machines where it can dispatch work by creating a JSON based - configuration file. The location of this file defaults to + of machines you created by editing a JSON-based configuration + file. The location of this file defaults to :file:`.remote_worker_records` in your home directory but can be overridden via the `--remote_worker_records_file` commandline argument. An example JSON configuration `can be @@ -47,10 +60,17 @@ do some setup work: `_. 5. Finally, you will also need tell the - :class:`executors.RemoteExecutor` how to invoke the - :file:`remote_worker.py` on remote machines by passing its path - on remote worker machines in your setup via the - `--remote_worker_helper_path` commandline flag. + :class:`pyutils.parallelize.executors.RemoteExecutor` how to + invoke the :file:`remote_worker.py` on remote machines by + passing its path on remote worker machines in your setup via + the `--remote_worker_helper_path` commandline flag (or, + honestly, if you made it this far, just update the default in + this code -- find `executors.py` under `site-packages` in your + virtual environment and update the default value of the + `--remote_worker_helper_path` flag) + + If you're trying to set this up and struggling, email me at + scott.gasch@gmail.com. I'm happy to help. """ @@ -76,44 +96,49 @@ def parallelize( Sample usage:: - @parallelize # defaults to thread-mode + from pyutils.parallelize import parallelize as p + + @p.parallelize # defaults to thread-mode def my_function(a, b, c) -> int: ...do some slow / expensive work, e.g., an http request - @parallelize(method=Method.PROCESS) + @p.parallelize(method=Method.PROCESS) def my_other_function(d, e, f) -> str: ...do more really expensive work, e.g., a network read - @parallelize(method=Method.REMOTE) + @p.parallelize(method=Method.REMOTE) def my_other_other_function(g, h) -> int: ...this work will be distributed to a remote machine pool - This decorator will invoke the wrapped function on:: + This decorator will invoke the wrapped function on: - Method.THREAD (default): a background thread - Method.PROCESS: a background process - Method.REMOTE: a process on a remote host + - `Method.THREAD` (default): a background thread + - `Method.PROCESS`: a background process + - `Method.REMOTE`: a process on a remote host The wrapped function returns immediately with a value that is - wrapped in a :class:`SmartFuture`. This value will block if it is - either read directly (via a call to :meth:`_resolve`) or - indirectly (by using the result in an expression, printing it, - hashing it, passing it a function argument, etc...). See comments - on :class:`SmartFuture` for details. The value can be safely - stored (without hashing) or passed as an argument without causing - it to block waiting on a result. There are some convenience - methods for dealing with collections of :class:`SmartFuture` - objects defined in :file:`smart_future.py`, namely - :meth:`smart_future.wait_any` and :meth:`smart_future.wait_all`. + wrapped in a + :class:`pyutils.parallelize.smart_future.SmartFuture`. This value + will block if it is either read directly (via a call to + :meth:`_resolve`) or indirectly (by using the result in an + expression, printing it, hashing it, passing it a function + argument, etc...). See comments on :class:`SmartFuture` for + details. The value can be safely stored (without hashing) or + passed as an argument without causing it to block waiting on a + result. There are some convenience methods for dealing with + collections of :class:`SmartFuture` objects defined in + :file:`smart_future.py`, namely + :meth:`pyutils.parallelize.smart_future.wait_any` and + :meth:`pyutils.parallelize.smart_future.wait_all`. .. warning:: You may stack @parallelized methods and it will "work". That said, having multiple layers of :code:`Method.PROCESS` or - :code:`Method.REMOTE` will prove to be problematic because each process in - the stack will use its own independent pool which may overload - your machine with processes or your network with remote processes - beyond the control mechanisms built into one instance of the pool. - Be careful. + :code:`Method.REMOTE` will prove to be problematic because each + process in the stack will use its own independent pool which will + likely overload your machine with processes or your network with + remote processes beyond the control mechanisms built into one + instance of the pool. Be careful. .. warning:: There is non-trivial overhead of pickling code and