Package namespacing for Python library collection

https://www.pikrepo.com/fdrmu/brown-wooden-book-shelves-in-library

A practical guide on how to manage a collection of code snippets as a single, easy to maintain library collection. I will utilise Python package namespacing, managed in a Git mono-repository.

Are you, your team or organisation suffering from constant copying of fragments of code among projects? Do they diverge quickly? Are you already finding an increasing need to reuse snippets of code in multiple places as your project/team/business grows? Then this article is for you. I will primarily focus on a mono-repo library collection but I will present other technologies as well. Hopefully, it will make choosing the most appropriate one for your situation easier.

Library collection (hereafter as library) is a number of code pieces, wrapped in packages (hereafter as sub-package) that are distributable individually and can be re-used in multiple projects.

TL;DR: If you’re here just for an example, have a look at an example Git repository.

Typical candidates for sub-packaging:

Constants, schemas, company policies
Utility functions extending libraries
Development tools

Motivation

The ultimate drive behind “librarification” is lowering the maintenance cost, which is affected by several closely related properties.

Breaking changes

A breaking change occurs when you make a backward incompatible change, such as removing or renaming a function, a function parameter, a package or changing a function behaviour. In Python, packages are versioned using and frequently comply with Semantic versioning. It is advisable to use the same for your own library.

To keep the maintenance cost as low as possible, you want to reduce the number of breaking changes.

Dependencies

The more code your library accumulates, the more likely it is for breaking changes to occur. But not every breaking change affects the whole library. Let’s say your library looks like this:

company_utils
├── __init__.py
├── constants.py     # only constants without any dependencies
├── logging_utils.py # depends only on the Python logging library
└── flask_utils.py   # several utility functions used with the Flask
                     # web framework and depends on importing it

constants.py contains only constants without any dependencies.
logging_utils.py depends only on the Python logging library.
flask_utils.py contains several utility functions used with the Flask web framework and depends on importing it.

If you need to remove an obsolete constant from company_utils.constants, you need to bump the major number of your library version, such as 1.2.4 -> 2.0.0. This will notify the user of the library “hey, you need to check what has changed and modify your code”. However, there is no need to modify the code if you don't use company_utils.constants. Maybe you just use company_utils.logging_utils and the change is not breaking for you.

In the example above I tried to illustrate how unnecessary breaking changes increase the maintenance cost.

Domain separation

Multiple repositories

Building on top of the previous section, it may seem trivial to just split the different domains into separate packages hosted in individual repositories. However, this increases the maintenance cost again.

Now you need to maintain all development tooling in multiple repositories. This may include CI configuration, linter settings, documentation, build scripts etc. The actual code can be as small as a single file. Therefore, the maintenance cost on keeping multiple repositories up to date will likely outweigh any benefit gained from the split.

Extras

If multiple packages in individual repositories are not the answer, what about everything in a single repository? Popular packaging and distribution tools, such as setuptools, Pipenv and Poetry, allow declaring “extras” — optional features with their own dependencies. You could treat you library as a package and the sub-packages as extras.

You would install such a library as:

pip install "company_utils[logging_utils,constants]==2.0.0"

The library no longer brings a number of unused dependencies. However, this approach has still many of the negatives:

The entire library uses a single version
Code is distributed even when not used
import company_utils.flask_utils will not show any errors in your IDE but will fail on execution because the flask dependency is not installed
Nothing prevents cross sub-package dependencies

Package namespacing

Another option is to use package namespacing, namely the native/implicit namespace packages as defined in . Both setuptools and Poetry support package namespacing.

The documentation is vague on how namespacing helps and how to use it for multiple sub-packages. As it turns out, package namespacing is not designed to work in a single repository out of the box. Attempt to do so results in a mono-repo. The key missing information is that each namespaced package needs its own build script that must live outside of the package. This is tricky in a mono-repo because you cannot easily have multiple setup.py/ pyproject.toml files in the same folder.

File structure examples

setuptools variant, alternative 1:

setup-constants.py       # Each setup-*.py must explicitly
setup-flask_utils.py     # include one sub-package
setup-logging_utils.py
company_utils/           # No __init__.py here.
├── constants/           # Sub-packages have __init__.py.
|   ├── __init__.py
|   └── constants.py
├── flask_utils/
|   ├── __init__.py
|   └── flask_utils.py
└── logging_utils/
    ├── __init__.py
    └── logging_utils.py

setuptools variant, alternative 2:

company_utils.constants/
├── setup.py # All setup.py differ only in the package name
└── src/
    └── company_utils/
        └── constants/
            ├── __init__.py
            └── constants.py
company_utils.flask_utils/
├── setup.py
└── src/
    └── company_utils/
        └── flask_utils/
            ├── __init__.py
            └── flask_utils.py
company_utils.logging_utils/
├── setup.py
└── src/
    └── company_utils/
        └── logging_utils/
            ├── __init__.py
            └── logging_utils.py

poetry variant:

company_utils.constants/
├── pyproject.toml
└── src/
    └── constants/
        ├── __init__.py
        └── constants.py
company_utils.flask_utils/
├── pyproject.toml
└── src/
    └── flask_utils/
        ├── __init__.py
        └── flask_utils.py
company_utils.logging_utils/
├── pyproject.toml
└── src/
    └── logging_utils/
        ├── __init__.py
        └── logging_utils.py

You can see that having multiple setup or pyproject files is ugly and increases maintenance cost by introducing duplication. A better solution is suggested in the next chapter.

Low maintenance namespacing solution

It’s time to tie together information from the previous chapters. We are aiming for a solution with constant maintenance cost, independent of the number of sub-packages. The resulting solution allows versioning and distribution of its sub-packages independently. Package namespacing provides an easy way to find them and import them.

To summarize the approach, we will replace duplication with iteration.

Build tools

For building the sub-packages, we will use setuptools as they offer higher flexibility. setup.py is just a Python script. We will parametrize it to get rid of the need for multiple files. Additionally, we will capture each sub-package requirements in a requirements.txt file. We will also keep version of each sub-package in __version__ of each src/<NAMESPACE>/<SUB-PACKAGE>/__init__.py file.

File structure:

setup.py
src/
└── company_utils/              # No __init__.py here.
    └── constants/              # Sub-packages have __init__.py.
    |   ├── __init__.py
    |   ├── constants.py
    |   └── requirements.txt
    └── flask_utils/
    |   ├── __init__.py
    |   ├── flask_utils.py
    |   └── requirements.txt
    └── logging_utils/
        ├── __init__.py
        ├── logging_utils.py
        └── requirements.txt

See example setup.py and requirements.txt files.

We will also need something to build all packages as setup.py builds only one at a time. Personally, I like to automate tasks with PyInvoke. But any Python/Bash/other script will do as well. See an example build task.

With this setup, we can build all packages at once:

company_utils.constants-1.0.0-py3-none-any.whl
company_utils.flask_utils-1.0.0-py3-none-any.whl
company_utils.logging_utils-1.0.0-py3-none-any.whl

and push them to a package registry (PyPI, PackageCloud, etc.).

Local development

You may have noticed that having many requirements.txt doesn't make local development developer-friendly. How are you going to install all those requirements to not have import errors? And how you will keep them up-to-date?

Let us add another automation task to install all the src/<NAMESPACE>/<SUB-PACKAGE>/requirements.txtfiles. This task will either install all available dependencies or dependencies of selected sub-package. You can also see that pipenv is being called. I recommend using Pipenv or Poetry to manage your development dependencies and Virtual Python Environment.

How would this look like in practice? You would use your pipenv sync -d or poetry install for dev dependencies and pipenv run inv install_subpackage_dependencies or poetry run inv install_subpackage_dependencies for sub-package dependencies.

Continuous integration

Another problem you may have noticed is that installing all sub-package dependencies will prevent tests from discovering import of dependencies from other sub-packages. For example, if you import flask in company_utils.constants, it will work locally but fail when the library will be installed. Continuous integration (CI) comes to the rescue! The “cross-import” scenario should be rare. Therefore, you can leave it to fail in a CI pipeline instead and keep a lot of complexity out of the local development environment. CI will be the quality gate.

Hopefully, the CI solution of your choice allows parametrization of jobs (such a CircleCI Matrix Jobs). Each parameter in this case will be the name of a sub-package. Since you want to target specific sub-packages, it is also a good idea to split your tests in folders named by the sub-package. Then the pipeline could look like:

install pipenvpipenv clean
# run if you cache dependenciespipenv install --dev --deploy
# makes sure Pipfile.lock is up to datepipenv run inv install_subpackage_dependencies --name ${sub_package}
# Only a single sub-package dependencies are now present
# Run any tests you like

Note that there is a slight overhead in the matrix job on checking out the source code and figuring out if a pipeline for each library and Python version combination needs to run. If you have a large number of libraries, you could benefit from running all sub-packages in a loop of a single pipeline and just swapping dependencies with the install_subpackage_dependencies task. You will lose the isolation but gain speed in having only one setup step.

Versioning, semantic releases

As mentioned before, version number is kept in __version__ of each src/<NAMESPACE>/<SUB-PACKAGE>/__init__.py file. If you are wondering how to keep a changelog or automate versioning with semantic releases, I will describe it in a future blog post. For now, you can have a look at these resources for inspiration:

Lerna: A JavaScript tool for managing projects with multiple packages
This example change log of a mono-repo project using Lerna

When not to choose this approach?

This article suggests a middle ground between individual repositories that are difficult to maintain for tiny libraries and a single large library that contains a lot of unnecessary dependencies. If your library is large or in most cases requires all of its dependencies, I would suggest a traditional single library in a single repository approach. Adding new libraries is easy. So it may be tempting treating a namespaced library collection as a Golden Hammer.

Summary

Maintaining a collection of libraries can save a lot of development time. However, due to the lack of direct support in all commonly used build tools, it has also a small upfront cost on developing you own tasks around it. Hopefully, this article has helped you to see if the investment is worth the potential gains or event implement similar solution on your own.

All the examples above have a working example in this Git repository.

Package namespacing for Python library collection

Package namespacing for Python library collection

Motivation

Breaking changes

Dependencies

Domain separation

Multiple repositories

Extras

Package namespacing

Low maintenance namespacing solution

Build tools

Local development

Continuous integration

Versioning, semantic releases

When not to choose this approach?

Summary

Recommend

Testing external dependencies using dependency injection

Better fuzzy-finding in Vim

Retrieving Recipes from Images: Baselines Strike Back

Navigating towards a new navigation

Easy debugging with the Android Navigation component

Writing GitHub Actions in Go

Using google-java-format with VS Code

Accessing Secret Manager from Terraform

Managing Secret Manager with Terraform

Rotating Google-managed SSL certificates with zero downtime

About Joyk