6

Writing and unit testing a Python application to query the RPM database

 2 years ago
source link: https://www.redhat.com/sysadmin/query-rpm-database-python
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Writing and unit testing a Python application to query the RPM database

Write a Python program that prints a list of software installed on your system, then test whether the application behaves correctly.

Posted: December 1, 2021 | by Jose Vicente Nunez (Sudoer)

Image
Person programming with Python book nearby

When installing software on a Linux system, your package manager keeps track of what's installed, what it's dependent upon, what it provides, and much more.

The usual way to look at that metadata is through your package manager. In the case of Fedora or Red Hat Enterprise Linux, it is the RPM database.

The RPM database can be queried from the command line with the rpm command, which supports some very nice formatting options. For example, to get a list of all packages sorted by size, I can use a little bit of Bash glue to do the following:

$ rpm -qa --queryformat "%{NAME}-%{VERSION} %{SIZE}\n"|sort -r -k 2 -n
...
linux-firmware-20200421 256914779
conda-4.8.4 228202733
glibc-all-langpacks-2.29 217752640
docker-ce-cli-19.03.12 168609825
clang-libs-8.0.0 117688777
wireshark-cli-3.2.3 108810552
llvm-libs-8.0.0 108310728
docker-ce-19.03.12 106517368
ansible-2.9.9 102477070

The rpm command has many options, but what if I want to format the numbers in the output? Or be able to display the results with scrolling? Or integrate RPM output into other applications?

Once you've written a script to manage the query, what about unit testing the code? Bash is not exactly good at that.

[ Download the Bash shell scripting cheat sheet to keep the basics close at hand. ]

This is where other languages like Python shine. In this tutorial, I demonstrate how to:

  • Interact with the RPM database using Python RPM bindings
  • Write a Python class to create a wrapper around the provided RPM bindings
  • Make the script easier to use and customize by calling argparse methods to parse the command-line interface
  • Unit test the new code with unittest

I'll cover a lot here, so basic knowledge of Python (for example, object-oriented, or OO, programming) is necessary. You should be able to grasp how to create the Python class even if you don't know much about OO programming. (I will keep out advanced features, like virtual classes, data classes, and other features.)

Also, you should know what the RPM database is. Even if you don't know much, the code is simple to follow, and the boilerplate code is small.

There are several chapters dedicated to interacting with the RPM database in the Fedora documentation. However, in this article, I'll write a simple program that prints a list of RPM packages sorted by size.

The RPM database and Python

Python is a great default language for system administrators, and the RPM database comes with bindings that make it easy to query with Python.

[ Sign up for the free online course Red Hat Enterprise Linux Technical Overview. ]

Install the python3-rpm package

For this tutorial to work, you need to install the python3-rpm package.

RPM has deep ties with the system. That's one of the reasons it's not offered through the pip command and the PyPi module repository.

Install the python3-rpm package with the dnf command instead:

$ sudo dnf install -y python3-rpm

Finally, clone the code for this tutorial:

$ git clone [email protected]:josevnz/tutorials.git
$ cd rpm_query

Create a Python virtual environment

Python has a feature called virtual environments. A virtual environment provides a sandbox location to:

  • Install different versions of libraries than the rest of the system.
  • Get isolation from bad or unstable libraries so that an issue with a library doesn't compromise your whole Python installation.
  • Isolate software for testing more easily before merging the feature branch into the main branch. (A continuous integration pipeline takes care of deploying new code in a test virtual environment.)

For this tutorial, use a virtual environment with the following features:

  • The environment is called rpm_query.
  • I leak the system packages inside the environment, as the system package needed is python3-rpm, and it's not available through pip. (You can provide access to site packages using the --system-site-packages option.)

Create and activate it like this:

$ python3 -m venv --system-site-packages ~/virtualenv/rpm_query
. ~/virtualenv/rpm_query/bin/activate

Code the Query RPM class

For this example application, I'll wrap the RPM instance in a context manager. This makes it easier to use without worrying about manually closing the RPM database. Also, I'll take a few shortcuts to return the result of the database query. Specifically, I'm limiting the number of results and managing sorting.

[ Learn 16 steps for building production-ready Kubernetes clusters. ]

Putting all this functionality into a class (a collection of data and methods) together is what makes object orientation so useful. In this example, the RPM functionality is in a class called QueryHelper and its purpose is:

  • To define filtering parameters like the number of items and sorting results at creation time (use a class constructor for that)
  • To provide a way to iterate through every package of the system

First, use the QueryHelper class to get a list of a maximum of five packages, sorted by size:

from reporter.rpm_query import QueryHelper
with QueryHelper(limit=5, sorted_val=True) as rpm_query:
    for package in rpm_query:
        print(f"{package['name']}-{package['version']}: {package['size']:,.0f}")

You can use a Python feature called named arguments, which makes using the class much easier.

What if you are happy with the default arguments? Not a problem:

from reporter.rpm_query import QueryHelper
with QueryHelper() as rpm_query:
    for package in rpm_query:
        print(f"{package['name']}-{package['version']}: {package['size']:,.0f}")

Here is how to implement it:

  • The code that imports the RPM is in a try/except because the RPM may not be installed. So if it fails, it does so gracefully, explaining that rpm needs to be installed.
  • The __get__ function takes care of returning the results sorted or as-is. Pass a reference to this function to the code that queries the database.
  • The customization magic happens in the constructor of the QueryHelper class, with named parameters.
  • Then the code returns database transactions in the QueryHelper.__enter__ method.
"""
Wrapper around RPM database
"""
import sys
from typing import Any

try:
    import rpm
except ModuleNotFoundError:
    print((
        "You must install the following package:\n"
        "sudo dnf install -y python3-rpm\n"
        "'rpm' doesn't come as a pip but as a system dependency.\n"
    ), file=sys.stderr)
    raise

def __get__(is_sorted: bool, dbMatch: Any) -> Any:
    """
    If is_sorted is true then sort the results by item size in bytes, otherwise
    return 'as-is'
    :param is_sorted:
    :param dbMatch:
    :return:
    """
    if is_sorted:
        return sorted(
            dbMatch,
            key=lambda item: item['size'], reverse=True)
    return dbMatch

class QueryHelper:
    MAX_NUMBER_OF_RESULTS = 10_000

    def __init__(self, *, limit: int = MAX_NUMBER_OF_RESULTS, name: str = None, sorted_val: bool = True):
        """
        :param limit: How many results to return
        :param name: Filter by package name, if any
        :param sorted_val: Sort results
        """
        self.ts = rpm.TransactionSet()
        self.name = name
        self.limit = limit
        self.sorted = sorted_val

    def __enter__(self):
        """
        Returns list of items on the RPM database
        :return:
        """
        if self.name:
            db = self.db = self.ts.dbMatch("name", self.name)
        else:
            db = self.db = self.ts.dbMatch()
        count = 0
        for package in __get__(self.sorted, db):
            if count >= self.limit:
                break
            yield package
            count += 1
    def __exit__(self, exc_type, exc_val, exc_tb):
        self.ts.closeDB()

A few things to note:

  • You can reuse this search logic not just in the command-line interface (CLI) application but also in future GUI or REST services because functionality lives on a well-defined unit (data + actions).
  • Did you see how I "hinted" to the interpreter about the types of the arguments in the constructor (like limit: int is an integer)? This helps an integrated development environment (IDE) like PyCharm provide auto-completion for you and your users. Python doesn't require it, but it is a good practice.
  • It may be good to provide default values to parameters. That can make a class easier to use if the developer decides not to override the defaults (in other words, makes the argument "optional").
  • It is also a good practice to document methods, even briefly. The IDE shows this detail when using the methods.

Conduct unit testing with unittest

Here's an important question: How good is this code without testing?

It's a good idea to automate testing code. Unit testing is different from other types of testing. Overall, it makes an application more robust because it ensures even minor components behave correctly (and it works better if it's run after every change).

In this case, Python unittest is nothing more than a class that automates testing for whether a function behaves correctly.

A good unit test:

  • Is small: If a method is testing more than one feature, it should probably be split into more tests.
  • Is self-contained: Ideally, it should not have external dependencies like databases or application servers. Sometimes, you can replace those dependencies with mock objects that can mimic a part of the system and even allow simulated failures.
  • Is repeatable: The result of running the test should be the same. For that, it may be necessary to clean before and after running a test (setUp(), tearDown()).

I wrote a unit test for the reporter.rpm_query.QueryHelper class:

  • The code has several assertions that must be true to pass the test case.
  • I do not use mock objects because the RPM framework is so ingrained into Fedora and Red Hat that the real database is guaranteed to be there. Also, the tests are non-destructive.
"""
Unit tests for the QueryHelper class
How to write unit tests: https://docs.python.org/3/library/unittest.html
"""
import os
import unittest
from reporter.rpm_query import QueryHelper

DEBUG = True if os.getenv("DEBUG_RPM_QUERY") else False

class QueryHelperTestCase(unittest.TestCase):

    def test_default(self):
        with QueryHelper() as rpm_query:
            for package in rpm_query:
                self.assertIn('name', package, "Could not get 'name' in package?")

    def test_get_unsorted_counted_packages(self):
        """
        Test retrieval or unsorted counted packages
        :return:
        """
        LIMIT = 10
        with QueryHelper(limit=LIMIT, sorted_val=False) as rpm_query:
            count = 0
            for package in rpm_query:
                count += 1
                self.assertIn('name', package, "Could not get 'name' in package?")
            self.assertEqual(LIMIT, count, f"Limit ({count}) did not worked!")

    def test_get_all_packages(self):
        """
        Default query is all packages, sorted by size
        :return:
        """
        with QueryHelper() as rpm_query:
            previous_size = 0
            previous_package = None
            for package in rpm_query:
                size = package['size']
                if DEBUG:
                    print(f"name={package['name']} ({size}) bytes")
                self.assertIn('name', package, "Could not get 'name' in package?")
                if previous_size > 0:
                    self.assertGreaterEqual(
                        previous_size,
                        size,
                        f"Returned entries not sorted by size in bytes ({previous_package}, {package['name']})!")
                    previous_size = size
                    previous_package = package['name']

    def test_get_named_package(self):
        """
        Test named queries
        :return:
        """
        package_name = "glibc-common"
        with QueryHelper(name=package_name, limit=1) as rpm_query:
            found = 0
            for package in rpm_query:
                self.assertIn('name', package, "Could not get 'name' in package?")
                if DEBUG:
                    print(f"name={package['name']}, version={package['version']}")
                found += 1
        self.assertGreater(found, 0, f"Could not find a single package with name {package_name}")

if __name__ == '__main__':
    unittest.main()

I strongly recommend reading the official unittest documentation for more details, as there's so much more to unit testing than this simple code conveys. For example, there is mock testing, which is particularly useful for complex system-dependency scenarios.

Improve the script with argparse

You can write a small CLI using the QueryHelper class created earlier, and to make it easier to customize, use argparse.

Argparse allows you to:

  • Validate user input and restrict the number of options: For example, to check if the number of results is a non-negative number, you can write a type validator on the reporter/init.py module. You could also do it in the class constructor (but I wanted to show this feature). You can use it to add extra logic not present in the original code:
def __is_valid_limit__(limit: str) -> int:
    try:
        int_limit = int(limit)
        if int_limit <= 0:
            raise ValueError(f"Invalid limit!: {limit}")
        return int_limit
    except ValueError:
        raise
  • Combine multiple options to make the program easier to use. I can even mark some required if needed.
  • Provide contextualized help on the program usage (help=, --help flag).

Using this class to query the RPM database becomes very easy by parsing the options and then calling QueryHelper:

#!/usr/bin/env python
"""
# rpmq_simple.py - A simple CLI to query RPM sizes on your system
Author: Jose Vicente Nunez
"""
import argparse
import textwrap

from reporter import __is_valid_limit__
from reporter.rpm_query import QueryHelper

if __name__ == "__main__":

    parser = argparse.ArgumentParser(description=textwrap.dedent(__doc__))
    parser.add_argument(
        "--limit",
        type=__is_valid_limit__, # Custom limit validator
        action="store",
        default=QueryHelper.MAX_NUMBER_OF_RESULTS,
        help="By default results are unlimited but you can cap the results"
    )
    parser.add_argument(
        "--name",
        type=str,
        action="store",
        help="You can filter by a package name."
    )
    parser.add_argument(
        "--sort",
        action="store_false",
        help="Sorted results are enabled bu default, but you fan turn it off"
    )
    args = parser.parse_args()

    with QueryHelper(
        name=args.name,
        limit=args.limit,
        sorted_val=args.sort
    ) as rpm_query:
        current = 0
        for package in rpm_query:
            if current >= args.limit:
                break
            print(f"{package['name']}-{package['version']}: {package['size']:,.0f}")
            current += 1

So how does the output look now? Ask for all installed RPMs, sorted and limited to the first 20 entries:

$ rpmqa_simple.py --limit 20
linux-firmware-20210818: 395,099,476
code-1.61.2: 303,882,220
brave-browser-1.31.87: 293,857,731
libreoffice-core-7.0.6.2: 287,370,064
thunderbird-91.1.0: 271,239,962
firefox-92.0: 266,349,777
glibc-all-langpacks-2.32: 227,552,812
mysql-workbench-community-8.0.23: 190,641,403
java-11-openjdk-headless-11.0.13.0.8: 179,469,639
iwl7260-firmware-25.30.13.0: 148,167,043
docker-ce-cli-20.10.10: 145,890,250
google-noto-sans-cjk-ttc-fonts-20190416: 136,611,853
containerd.io-1.4.11: 113,368,911
ansible-2.9.25: 101,106,247
docker-ce-20.10.10: 100,134,880
ibus-1.5.23: 90,840,441
llvm-libs-11.0.0: 87,593,600
gcc-10.3.1: 84,899,923
cldr-emoji-annotation-38: 80,832,870
kernel-core-5.14.12: 79,447,964

What I've covered

There's a lot of information in this article. I covered:

  • Writing modules and classes to interact with the RPM database. This article barely scratches the surface of all the things you can do with the API.
  • Automatically testing small pieces of the application with unittest.
  • Using Argparse to make an application easier to work with.

In my next article, I'll explore packaging an application to install on another machine with Python.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK