

Python Library xml.dom.minidom Howto (7)
source link: http://siongui.github.io/2012/05/27/python-xml-dom-minidom-howto-7/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

PARSE XML/HTML FROM A FILE
This post gives a real-world example about how to parse and retrieve data from a XML/HTML file by the use of Python xml.dom.minidom library. The following is a XML file which contains the explanation of a Pāli word abbhāna. We want to parse the file and extract the information.
example.xml | repository | view raw
<?xml version="1.0" encoding="utf-8"?> <cd> <item> <dict>◎ 《汉译パーリ语辞典》 黃秉榮譯 词数 7735.</dict> <word>abbhāna</word> <explain>%3a%6e%2e%20%5b%61%62%68%69%2d%c4%81%79%c4%81%6e%61%5d%20%e5%87%ba%e7%bd%aa%2c%20%e5%ae%b9%e8%a8%b1%2c%20%e5%be%a9%e6%ad%b8%28%e6%81%a2%e5%be%a9%e5%8e%9f%e7%8b%80%29%2e</explain> </item> <item> <dict>◎ 《パーリ语辞典》 日本水野弘元教授 词数 13772.</dict> <word>abbhāna</word> <explain>%3a%6e%2e%20%5b%61%62%68%69%2d%c4%81%79%c4%81%6e%61%5d%20%e5%87%ba%e7%bd%aa%2c%20%e8%a8%b1%e5%ae%b9%2c%20%e5%be%a9%e5%b8%b0%2e</explain> </item> <item> <dict>◎ 《巴汉词典》 明法尊者增订</dict> <word>Abbhāna</word> <explain>%2c%20%28%61%62%68%69%20%2b%20%c4%81%79%61%6e%61%20%6f%66%20%c4%81%20%2b%20%79%c4%81%20%28%69%29%29%2c%e3%80%90%e4%b8%ad%e3%80%91%e5%a4%8d%e5%bd%92%28%e6%af%94%e4%b8%98%e8%ba%ab%e4%bb%bd%29%28%63%6f%6d%69%6e%67%20%62%61%63%6b%2c%20%72%65%68%61%62%69%6c%69%74%61%74%69%6f%6e%20%6f%66%20%61%20%62%68%69%6b%6b%68%75%20%77%68%6f%20%68%61%73%20%75%6e%64%65%72%67%6f%6e%65%20%61%20%70%65%6e%61%6e%63%65%20%66%6f%72%20%61%6e%20%65%78%70%69%61%62%6c%65%20%6f%66%66%65%6e%63%65%29%e3%80%82</explain> </item> <item> <dict>◎ 《PTS Pali-English dictionary》 The Pali Text Society's Pali-English dictionary</dict> <word>Abbhāna</word> <explain>%2c%28%6e%74%2e%29%20%5b%61%62%68%69%20%2b%20%c4%81%79%61%6e%61%20%6f%66%20%c4%81%20%2b%20%3c%65%6d%3e%79%c4%81%3c%2f%65%6d%3e%3c%69%3e%20%28%3c%2f%69%3e%3c%65%6d%3e%69%3c%2f%65%6d%3e%3c%69%3e%29%3c%2f%69%3e%5d%20%63%6f%6d%69%6e%67%20%62%61%63%6b%2c%20%72%65%68%61%62%69%6c%69%74%61%74%69%6f%6e%20%6f%66%20%61%20%62%68%69%6b%6b%68%75%20%77%68%6f%20%68%61%73%20%75%6e%64%65%72%67%6f%6e%65%20%61%20%70%65%6e%61%6e%63%65%20%66%6f%72%20%61%6e%20%65%78%70%69%61%62%6c%65%20%6f%66%66%65%6e%63%65%20%56%69%6e%2e%49%2c%34%39%20%28%c2%b0%c3%a2%72%61%68%61%29%2c%20%35%33%20%28%69%64%2e%29%2c%20%31%34%33%2c%20%33%32%37%3b%20%49%49%2c%33%33%2c%20%34%30%2c%20%31%36%32%3b%20%41%2e%49%2c%39%39%2e%20%2d%2d%20%43%70%2e%20%3c%69%3e%61%62%62%68%65%74%69%3c%2f%69%3e%2e%20%28%50%61%67%65%20%36%30%29</explain> </item> </cd>
The following Python script parses the above XML file. In line 21, the script parses the XML file first. In line 23, we get the item element by calling getElementsByTagName. Then we parse each item one by one. Extract the content of the text node in line 11, 12, 13. The result of each item is printed in line 15, 16, 17. The code is straight forward and easy to understand.
minidom-howto-7.py | repository | view raw
#!/usr/bin/env python # -*- coding:utf-8 -*- import xml.dom.minidom def decodeItem(item): dict = item.getElementsByTagName("dict")[0] word = item.getElementsByTagName("word")[0] explain = item.getElementsByTagName("explain")[0] dictstr = dict.childNodes[0].data wordstr = word.childNodes[0].data explainstr = explain.childNodes[0].data print("dict: %s" % dictstr) print("word: %s" % wordstr) print("explain: %s" % explainstr) def main(): dom = xml.dom.minidom.parse("example.xml") items = dom.getElementsByTagName("item") for item in items: decodeItem(item) if __name__ == '__main__': main()
The result of the above Python script is:
dict: ◎ 《汉译パーリ语辞典》 黃秉榮譯 词数 7735. word: abbhāna explain: %3a%6e%2e%20%5b%61%62%68%69%2d%c4%81%79%c4%81%6e%61%5d%20%e5%87%ba%e7%bd%aa%2c%20%e5%ae%b9%e8%a8%b1%2c%20%e5%be%a9%e6%ad%b8%28%e6%81%a2%e5%be%a9%e5%8e%9f%e7%8b%80%29%2e dict: ◎ 《パーリ语辞典》 日本水野弘元教授 词数 13772. word: abbhāna explain: %3a%6e%2e%20%5b%61%62%68%69%2d%c4%81%79%c4%81%6e%61%5d%20%e5%87%ba%e7%bd%aa%2c%20%e8%a8%b1%e5%ae%b9%2c%20%e5%be%a9%e5%b8%b0%2e dict: ◎ 《巴汉词典》 明法尊者增订 word: Abbhāna explain: %2c%20%28%61%62%68%69%20%2b%20%c4%81%79%61%6e%61%20%6f%66%20%c4%81%20%2b%20%79%c4%81%20%28%69%29%29%2c%e3%80%90%e4%b8%ad%e3%80%91%e5%a4%8d%e5%bd%92%28%e6%af%94%e4%b8%98%e8%ba%ab%e4%bb%bd%29%28%63%6f%6d%69%6e%67%20%62%61%63%6b%2c%20%72%65%68%61%62%69%6c%69%74%61%74%69%6f%6e%20%6f%66%20%61%20%62%68%69%6b%6b%68%75%20%77%68%6f%20%68%61%73%20%75%6e%64%65%72%67%6f%6e%65%20%61%20%70%65%6e%61%6e%63%65%20%66%6f%72%20%61%6e%20%65%78%70%69%61%62%6c%65%20%6f%66%66%65%6e%63%65%29%e3%80%82 dict: ◎ 《PTS Pali-English dictionary》 The Pali Text Society's Pali-English dictionary word: Abbhāna explain: %2c%28%6e%74%2e%29%20%5b%61%62%68%69%20%2b%20%c4%81%79%61%6e%61%20%6f%66%20%c4%81%20%2b%20%3c%65%6d%3e%79%c4%81%3c%2f%65%6d%3e%3c%69%3e%20%28%3c%2f%69%3e%3c%65%6d%3e%69%3c%2f%65%6d%3e%3c%69%3e%29%3c%2f%69%3e%5d%20%63%6f%6d%69%6e%67%20%62%61%63%6b%2c%20%72%65%68%61%62%69%6c%69%74%61%74%69%6f%6e%20%6f%66%20%61%20%62%68%69%6b%6b%68%75%20%77%68%6f%20%68%61%73%20%75%6e%64%65%72%67%6f%6e%65%20%61%20%70%65%6e%61%6e%63%65%20%66%6f%72%20%61%6e%20%65%78%70%69%61%62%6c%65%20%6f%66%66%65%6e%63%65%20%56%69%6e%2e%49%2c%34%39%20%28%c2%b0%c3%a2%72%61%68%61%29%2c%20%35%33%20%28%69%64%2e%29%2c%20%31%34%33%2c%20%33%32%37%3b%20%49%49%2c%33%33%2c%20%34%30%2c%20%31%36%32%3b%20%41%2e%49%2c%39%39%2e%20%2d%2d%20%43%70%2e%20%3c%69%3e%61%62%62%68%65%74%69%3c%2f%69%3e%2e%20%28%50%61%67%65%20%36%30%29
Python Library xml.dom.minidom Howto series:
[1]Python Library xml.dom.minidom Howto (1)
[2]Python Library xml.dom.minidom Howto (2)
[3]Python Library xml.dom.minidom Howto (3)
[4]Python Library xml.dom.minidom Howto (4)
[5]Python Library xml.dom.minidom Howto (5)
[6]Python Library xml.dom.minidom Howto (6)
[7]Python Library xml.dom.minidom Howto (7)
Reference: MiniDom - Python Wiki
Recommend
-
56
-
41
-
37
1.问题描述属性无序问题和xml声明不是单独一行# cat HKEX-EPS_20180830_003249795.xml<?xml version="1.0" encoding="UTF-8"?><ETCML><IISHeadline><News Encoding="UTF-8" Language="en-us" TimeStamp="201808301
-
15
Howto - Upgrading Elixir to 1.3 on OSX Sep 9, 2016 So I realized that even though I'm just getting started, I'm already behind. chuckle. Here you go: elixir --version Erlan...
-
14
This guide will help you create an add-on for the WebThings Gateway. High-Level Concepts Add-on An add-on is a collection of code that the gateway runs to gain new features. This is loosely modeled after the add-on...
-
14
用Python的minidom写XML 2017-11-22 15:29:30 +08 字数:1543 标签: Python 读、或者说解析XML的需求很常见,而写、或者说生成XML...
-
7
Parsing XML with Python Minidom November 29, 2019 By Rowell Leave a Comment A core skill...
-
10
XML feed xml with mws api using PHP Client Library advertisements Hello I am new to Amazon API, I want to list my products using MWS API product fe...
-
1
DOM Testing Library: Is it Worth a Try?Short answer: Yes. The testing library approach to query elements with simulated user events looks really promising. This way of testing the app naturally improves accessibility (ARIA attrib...
-
4
OpenAI is a research organization that develops and promotes friendly AI for the betterment of humanity. One of its most popular projects is the development of the GPT (Generative Pre-trained Transformer) series, which includes ChatGPT, a large la...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK