

A Roadmap to XML Parsers in Python
source link: https://realpython.com/python-xml-parser/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Choose the Right XML Parsing Model
It turns out that you can process XML documents using a few language-agnostic strategies. Each demonstrates different memory and speed trade-offs, which can partially justify the wide range of XML parsers available in Python. In the following section, you’ll find out their differences and strengths.
Document Object Model (DOM)
Historically, the first and the most widespread model for parsing XML has been the DOM, or the Document Object Model, originally defined by the World Wide Web Consortium (W3C). You might have already heard about the DOM because web browsers expose a DOM interface through JavaScript to let you manipulate the HTML code of your websites. Both XML and HTML belong to the same family of markup languages, which makes parsing XML with the DOM possible.
The DOM is arguably the most straightforward and versatile model to use. It defines a handful of standard operations for traversing and modifying document elements arranged in a hierarchy of objects. An abstract representation of the entire document tree is stored in memory, giving you random access to the individual elements.
While the DOM tree allows for fast and omnidirectional navigation, building its abstract representation in the first place can be time-consuming. Moreover, the XML gets parsed at once, as a whole, so it has to be reasonably small to fit the available memory. This renders the DOM suitable only for moderately large configuration files rather than multi-gigabyte XML databases.
Use a DOM parser when convenience is more important than processing time and when memory is not an issue. Some typical use cases are when you need to parse a relatively small document or when you only need to do the parsing infrequently.
Simple API for XML (SAX)
To address the shortcomings of the DOM, the Java community came up with a library through a collaborative effort, which then became an alternative model for parsing XML in other languages. There was no formal specification, only organic discussions on a mailing list. The end result was an event-based streaming API that operates sequentially on individual elements rather than the whole tree.
Elements are processed from top to bottom in the same order they appear in the document. The parser triggers user-defined callbacks to handle specific XML nodes as it finds them in the document. This approach is known as “push” parsing because elements are pushed to your functions by the parser.
SAX also lets you discard elements if you’re not interested in them. This means it has a much lower memory footprint than DOM and can deal with arbitrarily large files, which is great for single-pass processing such as indexing, conversion to other formats, and so on.
However, finding or modifying random tree nodes is cumbersome because it usually requires multiple passes on the document and tracking the visited nodes. SAX is also inconvenient for handling deeply nested elements. Finally, the SAX model just allows for read-only parsing.
In short, SAX is cheap in terms of space and time but more difficult to use than DOM in most cases. It works well for parsing very large documents or parsing incoming XML data in real time.
Streaming API for XML (StAX)
Although somewhat less popular in Python, this third approach to parsing XML builds on top of SAX. It extends the idea of streaming but uses a “pull” parsing model instead, which gives you more control. You can think of StAX as an iterator advancing a cursor object through an XML document, where custom handlers call the parser on demand and not the other way around.
Note: It’s possible to combine more than one XML parsing model. For example, you can use SAX or StAX to quickly find an interesting piece of data in the document and then build a DOM representation of only that particular branch in memory.
Using StAX gives you more control over the parsing process and allows for more convenient state management. The events in the stream are only consumed when requested, enabling lazy evaluation. Other than that, its performance should be on par with SAX, depending on the parser implementation.
Learn About XML Parsers in Python’s Standard Library
In this section, you’ll take a look at Python’s built-in XML parsers, which are available to you in nearly every Python distribution. You’re going to compare those parsers against a sample Scalable Vector Graphics (SVG) image, which is an XML-based format. By processing the same document with different parsers, you’ll be able to choose the one that suits you best.
The sample image, which you’re about to save in a local file for reference, depicts a smiley face. It consists of the following XML content:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd" [
<!ENTITY custom_entity "Hello">
]>
<svg xmlns="http://www.w3.org/2000/svg"
xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
viewBox="-105 -100 210 270" width="210" height="270">
<inkscape:custom x="42" inkscape:z="555">Some value</inkscape:custom>
<defs>
<linearGradient id="skin" x1="0" x2="0" y1="0" y2="1">
<stop offset="0%" stop-color="yellow" stop-opacity="1.0"/>
<stop offset="75%" stop-color="gold" stop-opacity="1.0"/>
<stop offset="100%" stop-color="orange" stop-opacity="1"/>
</linearGradient>
</defs>
<g id="smiley" inkscape:groupmode="layer" inkscape:label="Smiley">
<!-- Head -->
<circle cx="0" cy="0" r="50"
fill="url(#skin)" stroke="orange" stroke-width="2"/>
<!-- Eyes -->
<ellipse cx="-20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
<ellipse cx="20" cy="-10" rx="6" ry="8" fill="black" stroke="none"/>
<!-- Mouth -->
<path d="M-20 20 A25 25 0 0 0 20 20"
fill="white" stroke="black" stroke-width="3"/>
</g>
<text x="-40" y="75">&custom_entity; <svg>!</text>
<script>
<![CDATA[
console.log("CDATA disables XML parsing: <svg>")
const smiley = document.getElementById("smiley")
const eyes = document.querySelectorAll("ellipse")
const setRadius = r => e => eyes.forEach(x => x.setAttribute("ry", r))
smiley.addEventListener("mouseenter", setRadius(2))
smiley.addEventListener("mouseleave", setRadius(8))
]]>
</script>
</svg>
It starts with an XML declaration, followed by a Document Type Definition (DTD) and the <svg>
root element. The DTD is optional, but it can help validate your document structure if you decide to use an XML validator. The root element specifies the default namespace xmlns
as well as a prefixed namespace xmlns:inkscape
for editor-specific elements and attributes. The document also contains:
- Nested elements
- Attributes
- Comments
- Character data (
CDATA
) - Predefined and custom entities
Go ahead, save the XML in a file named smiley.svg, and open it using a modern web browser, which will run the JavaScript snippet present at the end:
The code adds an interactive component to the image. When you hover the mouse over the smiley face, it blinks its eyes. If you want to edit the smiley face using a convenient graphical user interface (GUI), then you can open the file using a vector graphics editor such as Adobe Illustrator or Inkscape.
Note: Unlike JSON or YAML, some features of XML can be exploited by hackers. The standard XML parsers available in the xml
package in Python are insecure and vulnerable to an array of attacks. To safely parse XML documents from an untrusted source, prefer secure alternatives. You can jump to the last section in this tutorial for more details.
It’s worth noting that Python’s standard library defines abstract interfaces for parsing XML documents while letting you supply concrete parser implementation. In practice, you rarely do that because Python bundles a binding for the Expat library, which is a widely used open-source XML parser written in C. All of the following Python modules in the standard library use Expat under the hood by default.
Unfortunately, while the Expat parser can tell you if your document is well-formed, it can’t validate the structure of your documents against an XML Schema Definition (XSD) or a Document Type Definition (DTD). For that, you’ll have to use one of the third-party libraries discussed later.
xml.dom.minidom
: Minimal DOM Implementation
Considering that parsing XML documents using the DOM is arguably the most straightforward, you won’t be that surprised to find a DOM parser in the Python standard library. What is surprising, though, is that there are actually two DOM parsers.
The xml.dom
package houses two modules to work with DOM in Python:
xml.dom.minidom
xml.dom.pulldom
The first is a stripped-down implementation of the DOM interface conforming to a relatively old version of the W3C specification. It provides common objects defined by the DOM API such as Document
, Element
, and Attr
. This module is poorly documented and has quite limited usefulness, as you’re about to find out.
The second module has a slightly misleading name because it defines a streaming pull parser, which can optionally produce a DOM representation of the current node in the document tree. You’ll find more information about the pulldom
parser later.
There are two functions in minidom
that let you parse XML data from various data sources. One accepts either a filename or a file object, while another one expects a Python string:
>>> from xml.dom.minidom import parse, parseString
>>> # Parse XML from a filename
>>> document = parse("smiley.svg")
>>> # Parse XML from a file object
>>> with open("smiley.svg") as file:
... document = parse(file)
...
>>> # Parse XML from a Python string
>>> document = parseString("""\
... <svg viewBox="-105 -100 210 270">
... <!-- More content goes here... -->
... </svg>
... """)
The triple-quoted string helps embed a multiline string literal without using the continuation character (\
) at the end of each line. In any case, you’ll end up with a Document
instance, which exhibits the familiar DOM interface, letting you traverse the tree.
Apart from that, you’ll be able to access the XML declaration, DTD, and the root element:
>>> document = parse("smiley.svg")
>>> # XML Declaration
>>> document.version, document.encoding, document.standalone
('1.0', 'UTF-8', False)
>>> # Document Type Definition (DTD)
>>> dtd = document.doctype
>>> dtd.entities["custom_entity"].childNodes
[<DOM Text node "'Hello'">]
>>> # Document Root
>>> document.documentElement
<DOM Element: svg at 0x7fc78c62d790>
As you can see, even though the default XML parser in Python can’t validate documents, it still lets you inspect .doctype
, the DTD, if it’s present. Note that the XML declaration and DTD are optional. If the XML declaration or a given XML attribute is missing, then the corresponding Python attributes will be None
.
To find an element by ID, you must use the Document
instance rather than a specific parent Element
. The sample SVG image has two nodes with an id
attribute, but you can’t find either of them:
>>> document.getElementById("skin") is None
True
>>> document.getElementById("smiley") is None
True
That may be surprising for someone who has only worked with HTML and JavaScript but hasn’t worked with XML before. While HTML defines the semantics for certain elements and attributes such as <body>
or id
, XML doesn’t attach any meaning to its building blocks. You need to mark an attribute as an ID explicitly using DTD or by calling .setIdAttribute()
in Python, for example:
<!--ATTLIST linearGradient id ID #IMPLIED-->
Python
linearGradient.setIdAttribute("id")
However, using a DTD isn’t enough to fix the problem if your document has a default namespace, which is the case for the sample SVG image. To address this, you can visit all elements recursively in Python, check whether they have the id
attribute, and indicate it as their ID in one go:
>>> from xml.dom.minidom import parse, Node
>>> def set_id_attribute(parent, attribute_name="id"):
... if parent.nodeType == Node.ELEMENT_NODE:
... if parent.hasAttribute(attribute_name):
... parent.setIdAttribute(attribute_name)
... for child in parent.childNodes:
... set_id_attribute(child, attribute_name)
...
>>> document = parse("smiley.svg")
>>> set_id_attribute(document)
Your custom set_id_attribute()
function takes a parent element and an optional name for the identity attribute, which defaults to "id"
. When you call that function on your SVG document, then all children elements that have an id
attribute will become accessible through the DOM API:
>>> document.getElementById("skin")
<DOM Element: linearGradient at 0x7f82247703a0>
>>> document.getElementById("smiley")
<DOM Element: g at 0x7f8224770940>
Now, you’re getting the expected XML element corresponding to the id
attribute’s value.
Using an ID allows for finding at most one unique element, but you can also find a collection of similar elements by their tag name. Unlike the .getElementById()
method, you can call .getElementsByTagName()
on the document or a particular parent element to reduce the search scope:
>>> document.getElementsByTagName("ellipse")
[
<DOM Element: ellipse at 0x7fa2c944f430>,
<DOM Element: ellipse at 0x7fa2c944f4c0>
]
>>> root = document.documentElement
>>> root.getElementsByTagName("ellipse")
[
<DOM Element: ellipse at 0x7fa2c944f430>,
<DOM Element: ellipse at 0x7fa2c944f4c0>
]
Notice that .getElementsByTagName()
always returns a list of elements instead of a single element or None
. Forgetting about it when you switch between both methods is a common source of errors.
Unfortunately, elements like <inkscape:custom>
that are prefixed with a namespace identifier won’t be included. They must be searched using .getElementsByTagNameNS()
, which expects different arguments:
>>> document.getElementsByTagNameNS(
... "http://www.inkscape.org/namespaces/inkscape",
... "custom"
... )
...
[<DOM Element: inkscape:custom at 0x7f97e3f2a3a0>]
>>> document.getElementsByTagNameNS("*", "custom")
[<DOM Element: inkscape:custom at 0x7f97e3f2a3a0>]
The first argument must be the XML namespace, which typically has the form of a domain name, while the second argument is the tag name. Notice that the namespace prefix is irrelevant! To search all namespaces, you can provide a wildcard character (*
).
Note: To find the namespaces declared in your XML document, you can check out the root element’s attributes. In theory, they could be declared on any element, but the top-level one is where you’d usually find them.
Once you locate the element you’re interested in, you may use it to walk over the tree. However, another jarring quirk with minidom
is how it handles whitespace characters between elements:
>>> element = document.getElementById("smiley")
>>> element.parentNode
<DOM Element: svg at 0x7fc78c62d790>
>>> element.firstChild
<DOM Text node "'\n '">
>>> element.lastChild
<DOM Text node "'\n '">
>>> element.nextSibling
<DOM Text node "'\n '">
>>> element.previousSibling
<DOM Text node "'\n '">
The newline characters and leading indentation are captured as separate tree elements, which is what the specification requires. Some parsers let you ignore these, but not the Python one. What you can do, however, is collapse whitespace in such nodes manually:
>>> def remove_whitespace(node):
... if node.nodeType == Node.TEXT_NODE:
... if node.nodeValue.strip() == "":
... node.nodeValue = ""
... for child in node.childNodes:
... remove_whitespace(child)
...
>>> document = parse("smiley.svg")
>>> set_id_attribute(document)
>>> remove_whitespace(document)
>>> document.normalize()
Note that you also have to .normalize()
the document to combine adjacent text nodes. Otherwise, you could end up with a bunch of redundant XML elements with just whitespace. Again, recursion is the only way to visit tree elements since you can’t iterate over the document and its elements with a loop. Finally, this should give you the expected result:
>>> element = document.getElementById("smiley")
>>> element.parentNode
<DOM Element: svg at 0x7fc78c62d790>
>>> element.firstChild
<DOM Comment node "' Head '">
>>> element.lastChild
<DOM Element: path at 0x7f8beea0f670>
>>> element.nextSibling
<DOM Element: text at 0x7f8beea0f700>
>>> element.previousSibling
<DOM Element: defs at 0x7f8beea0f160>
>>> element.childNodes
[
<DOM Comment node "' Head '">,
<DOM Element: circle at 0x7f8beea0f4c0>,
<DOM Comment node "' Eyes '">,
<DOM Element: ellipse at 0x7fa2c944f430>,
<DOM Element: ellipse at 0x7fa2c944f4c0>,
<DOM Comment node "' Mouth '">,
<DOM Element: path at 0x7f8beea0f670>
]
Elements expose a few helpful methods and properties to let you query their details:
>>> element = document.getElementsByTagNameNS("*", "custom")[0]
>>> element.prefix
'inkscape'
>>> element.tagName
'inkscape:custom'
>>> element.attributes
<xml.dom.minidom.NamedNodeMap object at 0x7f6c9d83ba80>
>>> dict(element.attributes.items())
{'x': '42', 'inkscape:z': '555'}
>>> element.hasChildNodes()
True
>>> element.hasAttributes()
True
>>> element.hasAttribute("x")
True
>>> element.getAttribute("x")
'42'
>>> element.getAttributeNode("x")
<xml.dom.minidom.Attr object at 0x7f82244a05f0>
>>> element.getAttribute("missing-attribute")
''
For instance, you can check an element’s namespace, tag name, or attributes. If you ask for a missing attribute, then you’ll get an empty string (''
).
Dealing with namespaced attributes isn’t much different. You just have to remember to prefix the attribute name accordingly or provide the domain name:
>>> element.hasAttribute("z")
False
>>> element.hasAttribute("inkscape:z")
True
>>> element.hasAttributeNS(
... "http://www.inkscape.org/namespaces/inkscape",
... "z"
... )
...
True
>>> element.hasAttributeNS("*", "z")
False
Strangely enough, the wildcard character (*
) doesn’t work here as it did with the .getElementsByTagNameNS()
method before.
Since this tutorial is only about XML parsing, you’ll need to check the minidom
documentation for methods that modify the DOM tree. They mostly follow the W3C specification.
As you can see, the minidom
module isn’t terribly convenient. Its main advantage comes from being part of the standard library, which means you don’t have to install any external dependencies in your project to work with the DOM.
xml.sax
: The SAX Interface for Python
To start working with SAX in Python, you can use the same parse()
and parseString()
convenience functions as before, but from the xml.sax
package instead. You also have to provide at least one more required argument, which must be a content handler instance. In the spirit of Java, you provide one by subclassing a specific base class:
from xml.sax import parse
from xml.sax.handler import ContentHandler
class SVGHandler(ContentHandler):
pass
parse("smiley.svg", SVGHandler())
The content handler receives a stream of events corresponding to elements in your document as it’s being parsed. Running this code won’t do anything useful yet because your handler class is empty. To make it work, you’ll need to overload one or more callback methods from the superclass.
Fire up your favorite editor, type the following code, and save it in a file named svg_handler.py
:
# svg_handler.py
from xml.sax.handler import ContentHandler
class SVGHandler(ContentHandler):
def startElement(self, name, attrs):
print(f"BEGIN: <{name}>, {attrs.keys()}")
def endElement(self, name):
print(f"END: </{name}>")
def characters(self, content):
if content.strip() != "":
print("CONTENT:", repr(content))
This modified content handler prints out a few events onto the standard output. The SAX parser will call these three methods for you in response to finding the start tag, end tag, and some text between them. When you open an interactive session of the Python interpreter, import your content handler and give it a test drive. It should produce the following output:
>>> from xml.sax import parse
>>> from svg_handler import SVGHandler
>>> parse("smiley.svg", SVGHandler())
BEGIN: <svg>, ['xmlns', 'xmlns:inkscape', 'viewBox', 'width', 'height']
BEGIN: <inkscape:custom>, ['x', 'inkscape:z']
CONTENT: 'Some value'
END: </inkscape:custom>
BEGIN: <defs>, []
BEGIN: <linearGradient>, ['id', 'x1', 'x2', 'y1', 'y2']
BEGIN: <stop>, ['offset', 'stop-color', 'stop-opacity']
END: </stop>
⋮
That’s essentially the observer design pattern, which lets you translate XML into another hierarchical format incrementally. Say you wanted to convert that SVG file into a simplified JSON representation. First, you’ll want to store your content handler object in a separate variable to extract information from it later:
>>> from xml.sax import parse
>>> from svg_handler import SVGHandler
>>> handler = SVGHandler()
>>> parse("smiley.svg", handler)
Since the SAX parser emits events without providing any context about the element it’s found, you need to keep track of where you are in the tree. Therefore, it makes sense to push and pop the current element onto a stack, which you can simulate through a regular Python list. You may also define a helper property .current_element
that will return the last element placed on the top of the stack:
# svg_handler.py
# ...
class SVGHandler(ContentHandler):
def __init__(self):
super().__init__()
self.element_stack = []
@property
def current_element(self):
return self.element_stack[-1]
# ...
When the SAX parser finds a new element, you can immediately capture its tag name and attributes while making placeholders for children elements and the value, both of which are optional. For now, you can store every element as a dict
object. Replace your existing .startElement()
method with a new implementation:
# svg_handler.py
# ...
class SVGHandler(ContentHandler):
# ...
def startElement(self, name, attrs):
self.element_stack.append({
"name": name,
"attributes": dict(attrs),
"children": [],
"value": ""
})
The SAX parser gives you attributes as a mapping that you can convert to a plain Python dictionary with a call to the dict()
function. The element value is often spread over multiple pieces that you can concatenate using the plus operator (+
) or a corresponding augmented assignment statement:
# svg_handler.py
# ...
class SVGHandler(ContentHandler):
# ...
def characters(self, content):
self.current_element["value"] += content
Aggregating text in such a way will ensure that multiline content ends up in the current element. For example, the <script>
tag in the sample SVG file contains six lines of JavaScript code, which trigger separate calls to the characters()
callback.
Finally, once the parser stumbles on a closing tag, you can pop the current element from the stack and append it to its parent’s children. If there’s only one element left, then it will be your document’s root that you should keep for later. Other than that, you might want to clean the current element by removing keys with empty values:
# svg_handler.py
# ...
class SVGHandler(ContentHandler):
# ...
def endElement(self, name):
clean(self.current_element)
if len(self.element_stack) > 1:
child = self.element_stack.pop()
self.current_element["children"].append(child)
def clean(element):
element["value"] = element["value"].strip()
for key in ("attributes", "children", "value"):
if not element[key]:
del element[key]
Note that clean()
is a function defined outside of the class body. Cleaning must be done at the end since there’s no way of knowing up front how many text pieces to concatenate there might be. You can expand the collapsible section below for a complete content handler’s code.
Now, it’s time to put everything to the test by parsing the XML, extracting the root element from your content handler, and dumping it to a JSON string:
>>> from xml.sax import parse
>>> from svg_handler import SVGHandler
>>> handler = SVGHandler()
>>> parse("smiley.svg", handler)
>>> root = handler.current_element
>>> import json
>>> print(json.dumps(root, indent=4))
{
"name": "svg",
"attributes": {
"xmlns": "http://www.w3.org/2000/svg",
"xmlns:inkscape": "http://www.inkscape.org/namespaces/inkscape",
"viewBox": "-105 -100 210 270",
"width": "210",
"height": "270"
},
"children": [
{
"name": "inkscape:custom",
"attributes": {
"x": "42",
"inkscape:z": "555"
},
"value": "Some value"
},
⋮
It’s worth noting that this implementation has no memory gain over DOM because it builds an abstract representation of the whole document just as before. The difference is that you’ve made a custom dictionary representation instead of the standard DOM tree. However, you could imagine writing directly to a file or a database instead of memory while receiving SAX events. That would effectively lift your computer memory limit.
If you want to parse XML namespaces, then you’ll need to create and configure the SAX parser yourself with a bit of boilerplate code and also implement slightly different callbacks:
# svg_handler.py
from xml.sax.handler import ContentHandler
class SVGHandler(ContentHandler):
def startPrefixMapping(self, prefix, uri):
print(f"startPrefixMapping: {prefix=}, {uri=}")
def endPrefixMapping(self, prefix):
print(f"endPrefixMapping: {prefix=}")
def startElementNS(self, name, qname, attrs):
print(f"startElementNS: {name=}")
def endElementNS(self, name, qname):
print(f"endElementNS: {name=}")
These callbacks receive additional parameters about the element’s namespace. To make the SAX parser actually trigger those callbacks instead of some of the earlier ones, you must explicitly enable XML namespace support:
>>> from xml.sax import make_parser
>>> from xml.sax.handler import feature_namespaces
>>> from svg_handler import SVGHandler
>>> parser = make_parser()
>>> parser.setFeature(feature_namespaces, True)
>>> parser.setContentHandler(SVGHandler())
>>> parser.parse("smiley.svg")
startPrefixMapping: prefix=None, uri='http://www.w3.org/2000/svg'
startPrefixMapping: prefix='inkscape', uri='http://www.inkscape.org/namespaces/inkscape'
startElementNS: name=('http://www.w3.org/2000/svg', 'svg')
⋮
endElementNS: name=('http://www.w3.org/2000/svg', 'svg')
endPrefixMapping: prefix='inkscape'
endPrefixMapping: prefix=None
Setting this feature turns the element name
into a tuple comprised of the namespace’s domain name and the tag name.
The xml.sax
package offers a decent event-based XML parser interface modeled after the original Java API. It’s somewhat limited compared to the DOM but should be enough to implement a basic XML streaming push parser without resorting to third-party libraries. With this in mind, there’s a less verbose pull parser available in Python, which you’ll explore next.
xml.dom.pulldom
: Streaming Pull Parser
The parsers in the Python standard library often work together. For example, the xml.dom.pulldom
module wraps the parser from xml.sax
to take advantage of buffering and read the document in chunks. At the same time, it uses the default DOM implementation from xml.dom.minidom
for representing document elements. However, those elements are processed one at a time without bearing any relationship until you ask for it explicitly.
Note: The XML namespace support is enabled by default in xml.dom.pulldom
.
While the SAX model follows the observer pattern, you can think of StAX as the iterator design pattern, which lets you loop over a flat stream of events. Once again, you can call the familiar parse()
or parseString()
functions imported from the module to parse the SVG image:
>>> from xml.dom.pulldom import parse
>>> event_stream = parse("smiley.svg")
>>> for event, node in event_stream:
... print(event, node)
...
START_DOCUMENT <xml.dom.minidom.Document object at 0x7f74f9283e80>
START_ELEMENT <DOM Element: svg at 0x7f74fde18040>
CHARACTERS <DOM Text node "'\n'">
⋮
END_ELEMENT <DOM Element: script at 0x7f74f92b3c10>
CHARACTERS <DOM Text node "'\n'">
END_ELEMENT <DOM Element: svg at 0x7f74fde18040>
It takes only a few lines of code to parse the document. The most striking difference between xml.sax
and xml.dom.pulldom
is the lack of callbacks since you drive the whole process. You have a lot more freedom in structuring your code, and you don’t need to use classes if you don’t want to.
Notice that the XML nodes pulled from the stream have types defined in xml.dom.minidom
. But if you were to check their parents, siblings, and children, then you’d find out they know nothing about each other:
>>> from xml.dom.pulldom import parse, START_ELEMENT
>>> event_stream = parse("smiley.svg")
>>> for event, node in event_stream:
... if event == START_ELEMENT:
... print(node.parentNode, node.previousSibling, node.childNodes)
<xml.dom.minidom.Document object at 0x7f90864f6e80> None []
None None []
None None []
None None []
⋮
The relevant attributes are empty. Anyway, the pull parser can help in a hybrid approach to quickly look up some parent element and build a DOM tree only for the branch rooted in it:
from xml.dom.pulldom import parse, START_ELEMENT
def process_group(parent):
left_eye, right_eye = parent.getElementsByTagName("ellipse")
# ...
event_stream = parse("smiley.svg")
for event, node in event_stream:
if event == START_ELEMENT:
if node.tagName == "g":
event_stream.expandNode(node)
process_group(node)
By calling .expandNode()
on the event stream, you essentially move the iterator forward and parse XML nodes recursively until finding the matching closing tag of the parent element. The resulting node will have children with properly initialized attributes. Moreover, you’ll be able to use the DOM methods on them.
The pull parser offers an interesting alternative to DOM and SAX by combining the best of both worlds. It’s efficient, flexible, and straightforward to use, leading to more compact and readable code. You could also use it to process multiple XML files at the same time more easily. That said, none of the XML parsers mentioned so far can match the elegance, simplicity, and completeness of the last one to arrive in Python’s standard library.
xml.etree.ElementTree
: A Lightweight, Pythonic Alternative
The XML parsers you’ve come to know so far get the job done. However, they don’t fit Python’s philosophy very well, and that’s no accident. While DOM follows the W3C specification and SAX was modeled after a Java API, neither feels particularly Pythonic.
Even worse, both DOM and SAX parsers feel antiquated as some of their code in the CPython interpreter hasn’t changed for more than two decades! At the time of writing this, their implementation is still incomplete and has missing typeshed stubs, which breaks code completion in code editors.
Meanwhile, Python 2.5 brought a fresh perspective on parsing and writing XML documents—the ElementTree API. It’s a lightweight, efficient, elegant, and feature-rich interface that even some third-party libraries build on. To get started with it, you must import the xml.etree.ElementTree
module, which is a bit of a mouthful. Therefore, it’s customary to define an alias like this:
import xml.etree.ElementTree as ET
In slightly older code, you may have seen the cElementTree
module imported instead. It was an implementation several times faster than the same interface written in C. Today, the regular module uses the fast implementation whenever possible, so you don’t need to bother anymore.
You can use the ElementTree API by employing different parsing strategies:
Non-incremental
Incremental (Blocking)
Incremental (Non-blocking)
ET.parse()
✔️
ET.fromstring()
✔️
ET.iterparse()
✔️
ET.XMLPullParser
✔️
The non-incremental strategy loads up the entire document into memory in a DOM-like fashion. There are two appropriately named functions in the module that allow for parsing a file or a Python string with XML content:
>>> import xml.etree.ElementTree as ET
>>> # Parse XML from a filename
>>> ET.parse("smiley.svg")
<xml.etree.ElementTree.ElementTree object at 0x7fa4c980a6a0>
>>> # Parse XML from a file object
>>> with open("smiley.svg") as file:
... ET.parse(file)
...
<xml.etree.ElementTree.ElementTree object at 0x7fa4c96df340>
>>> # Parse XML from a Python string
>>> ET.fromstring("""\
... <svg viewBox="-105 -100 210 270">
... <!-- More content goes here... -->
... </svg>
... """)
<Element 'svg' at 0x7fa4c987a1d0>
Parsing a file object or a filename with parse()
returns an instance of the ET.ElementTree
class, which represents the whole element hierarchy. On the other hand, parsing a string with fromstring()
will return the specific root ET.Element
.
Alternatively, you can read the XML document incrementally with a streaming pull parser, which yields a sequence of events and elements:
>>> for event, element in ET.iterparse("smiley.svg"):
... print(event, element.tag)
...
end {http://www.inkscape.org/namespaces/inkscape}custom
end {http://www.w3.org/2000/svg}stop
end {http://www.w3.org/2000/svg}stop
end {http://www.w3.org/2000/svg}stop
end {http://www.w3.org/2000/svg}linearGradient
⋮
By default, iterparse()
emits only the end
events associated with the closing XML tag. However, you can subscribe to other events as well. You can find them with string constants such as "comment"
:
>>> import xml.etree.ElementTree as ET
>>> for event, element in ET.iterparse("smiley.svg", ["comment"]):
... print(element.text.strip())
...
Head
Eyes
Mouth
Here’s a list of all the available event types:
start
: Start of an elementend
: End of an elementcomment
: Comment elementpi
: Processing instruction, as in XSLstart-ns
: Start of a namespaceend-ns
: End of a namespace
The downside of iterparse()
is that it uses blocking calls to read the next chunk of data, which might be unsuitable for asynchronous code running on a single thread of execution. To alleviate that, you can look into XMLPullParser
, which is a little bit more verbose:
import xml.etree.ElementTree as ET
async def receive_data(url):
"""Download chunks of bytes from the URL asynchronously."""
yield b"<svg "
yield b"viewBox=\"-105 -100 210 270\""
yield b"></svg>"
async def parse(url, events=None):
parser = ET.XMLPullParser(events)
async for chunk in receive_data(url):
parser.feed(chunk)
for event, element in parser.read_events():
yield event, element
This hypothetical example feeds the parser with chunks of XML that can arrive a few seconds apart. Once there’s enough content, you can iterate over a sequence of events and elements buffered by the parser. This non-blocking incremental parsing strategy allows for a truly concurrent parsing of multiple XML documents on the fly while you download them.
Elements in the tree are mutable, iterable, and indexable sequences. They have a length corresponding to the number of their immediate children:
>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse("smiley.svg")
>>> root = tree.getroot()
>>> # The length of an element equals the number of its children.
>>> len(root)
5
>>> # The square brackets let you access a child by an index.
>>> root[1]
<Element '{http://www.w3.org/2000/svg}defs' at 0x7fe05d2e8860>
>>> root[2]
<Element '{http://www.w3.org/2000/svg}g' at 0x7fa4c9848400>
>>> # Elements are mutable. For example, you can swap their children.
>>> root[2], root[1] = root[1], root[2]
>>> # You can iterate over an element's children.
>>> for child in root:
... print(child.tag)
...
{http://www.inkscape.org/namespaces/inkscape}custom
{http://www.w3.org/2000/svg}g
{http://www.w3.org/2000/svg}defs
{http://www.w3.org/2000/svg}text
{http://www.w3.org/2000/svg}script
Tag names might be prefixed with an optional namespace enclosed in a pair of curly braces ({}
). The default XML namespace appears there, too, when defined. Notice how the swap assignment in the highlighted line made the <g>
element come before <defs>
. This shows the mutable nature of the sequence.
Here are a few more element attributes and methods that are worth mentioning:
>>> element = root[0]
>>> element.tag
'{http://www.inkscape.org/namespaces/inkscape}custom'
>>> element.text
'Some value'
>>> element.attrib
{'x': '42', '{http://www.inkscape.org/namespaces/inkscape}z': '555'}
>>> element.get("x")
'42'
One of the benefits of this API is how it uses Python’s native data types. Above, it uses a Python dictionary for the element’s attributes. In the previous modules, those were wrapped in less convenient adapters. Unlike the DOM, the ElementTree API doesn’t expose methods or properties for walking over the tree in any direction, but there are a couple of better alternatives.
As you’ve seen before, instances of the Element
class implement the sequence protocol, letting you iterate over their direct children with a loop:
>>> for child in root:
... print(child.tag)
...
{http://www.inkscape.org/namespaces/inkscape}custom
{http://www.w3.org/2000/svg}defs
{http://www.w3.org/2000/svg}g
{http://www.w3.org/2000/svg}text
{http://www.w3.org/2000/svg}script
You get the sequence of the root’s immediate children. To go deeper into nested descendants, however, you’ll have to call the .iter()
method on the ancestor element:
>>> for descendant in root.iter():
... print(descendant.tag)
...
{http://www.w3.org/2000/svg}svg
{http://www.inkscape.org/namespaces/inkscape}custom
{http://www.w3.org/2000/svg}defs
{http://www.w3.org/2000/svg}linearGradient
{http://www.w3.org/2000/svg}stop
{http://www.w3.org/2000/svg}stop
{http://www.w3.org/2000/svg}stop
{http://www.w3.org/2000/svg}g
{http://www.w3.org/2000/svg}circle
{http://www.w3.org/2000/svg}ellipse
{http://www.w3.org/2000/svg}ellipse
{http://www.w3.org/2000/svg}path
{http://www.w3.org/2000/svg}text
{http://www.w3.org/2000/svg}script
The root element has only five children but thirteen descendants in total. It’s also possible to narrow down the descendants by filtering only specific tag names using an optional tag
argument:
>>> tag_name = "{http://www.w3.org/2000/svg}ellipse"
>>> for descendant in root.iter(tag_name):
... print(descendant)
...
<Element '{http://www.w3.org/2000/svg}ellipse' at 0x7f430baa03b0>
<Element '{http://www.w3.org/2000/svg}ellipse' at 0x7f430baa0450>
This time, you only got two <ellipse>
elements. Remember to include the XML namespace, such as {http://www.w3.org/2000/svg}
, in your tag name—as long as it’s been defined. Otherwise, if you only provide the tag name without the right namespace, you could end up with fewer or more descendant elements than initially anticipated.
Dealing with namespaces is more convenient when using .iterfind()
, which accepts an optional mapping of prefixes to domain names. To indicate the default namespace, you can leave the key blank or assign an arbitrary prefix, which must be used in the tag name later:
>>> namespaces = {
... "": "http://www.w3.org/2000/svg",
... "custom": "http://www.w3.org/2000/svg"
... }
>>> for descendant in root.iterfind("g", namespaces):
... print(descendant)
...
<Element '{http://www.w3.org/2000/svg}g' at 0x7f430baa0270>
>>> for descendant in root.iterfind("custom:g", namespaces):
... print(descendant)
...
<Element '{http://www.w3.org/2000/svg}g' at 0x7f430baa0270>
The namespace mapping lets you refer to the same element with different prefixes. Surprisingly, if you try to find those nested <ellipse>
elements like before, then .iterfind()
won’t return anything because it expects an XPath expression rather than a simple tag name:
>>> for descendant in root.iterfind("ellipse", namespaces):
... print(descendant)
...
>>> for descendant in root.iterfind("g/ellipse", namespaces):
... print(descendant)
...
<Element '{http://www.w3.org/2000/svg}ellipse' at 0x7f430baa03b0>
<Element '{http://www.w3.org/2000/svg}ellipse' at 0x7f430baa0450>
By coincidence, the string "g"
happens to be a valid path relative to the current root
element, which is why the function returned a non-empty result before. However, to find the ellipses nested one level deeper in the XML hierarchy, you need a more verbose path expression.
ElementTree has limited syntax support for the XPath mini-language, which you can use to query elements in XML, similar to CSS selectors in HTML. There are other methods that accept such an expression:
>>> namespaces = {"": "http://www.w3.org/2000/svg"}
>>> root.iterfind("defs", namespaces)
<generator object prepare_child.<locals>.select at 0x7f430ba6d190>
>>> root.findall("defs", namespaces)
[<Element '{http://www.w3.org/2000/svg}defs' at 0x7f430ba09e00>]
>>> root.find("defs", namespaces)
<Element '{http://www.w3.org/2000/svg}defs' at 0x7f430ba09e00>
While .iterfind()
yields matching elements lazily, .findall()
returns a list, and .find()
returns only the first matching element. Similarly, you can extract text enclosed between the opening and closing tags of elements using .findtext()
or get the inner text of the entire document with .itertext()
:
>>> namespaces = {"i": "http://www.inkscape.org/namespaces/inkscape"}
>>> root.findtext("i:custom", namespaces=namespaces)
'Some value'
>>> for text in root.itertext():
... if text.strip() != "":
... print(text.strip())
...
Some value
Hello <svg>!
console.log("CDATA disables XML parsing: <svg>")
⋮
You look for text embedded in a specific XML element first, then everywhere in the whole document. Searching by text is a powerful feature of the ElementTree API. It’s possible to replicate it using other built-in parsers, but at the cost of increased code complexity and less convenience.
The ElementTree API is probably the most intuitive one of them all. It’s Pythonic, efficient, robust, and universal. Unless you have a specific reason to use DOM or SAX, this should be your default choice.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK