7

Perl and XML in 2021: A few lessons learned

 4 years ago
source link: https://phoenixtrap.com/2021/03/27/perl-and-xml-in-2021-a-few-lessons-learned/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Post navigation

It’s been years since I’ve had to hack on any­thing XML-relat­ed, but a recent project at work has me once again jump­ing into the waters of gen­er­at­ing, pars­ing, and mod­i­fy­ing this 90s-era doc­u­ment for­mat. Most devel­op­ers these days like­ly only know of it as part of the curi­ous­ly-named XML­HTTPRe­quest object in web browsers used to retrieve data in JSON for­mat from servers, and as the “X” in AJAX. But here we are in 2021, and there are still plen­ty of APIs and doc­u­ments using XML to get their work done.

In my par­tic­u­lar case, the task is to update the API calls for a new ver­sion of Vir­tuoz­zo Automa­tor. Its API is a bit unusu­al in that it does­n’t use HTTP, but rather relies on open­ing a TLS-encrypt­ed sock­et to the serv­er and exchang­ing doc­u­ments delim­it­ed with a null char­ac­ter. The pre­vi­ous ver­sion of our code is in 1990s-sysad­min-style Perl, with man­u­al blessing of objects and pars­ing the XML using reg­u­lar expres­sions. I’ve decid­ed to update it to use the Moo object sys­tem and a prop­er XML pars­er. But which pars­er and mod­ule to use?

Selecting a parser

There are sev­er­al gener­ic XML mod­ules for pars­ing and gen­er­at­ing XML on CPAN, each with its own advan­tages and dis­ad­van­tages. I’d like to say that I did a com­pre­hen­sive sur­vey of each of them, but this project is pressed for time (aren’t they all?) and I did­n’t want to cre­ate too many extra depen­den­cies in our Perl stack. Luck­i­ly, XML::LibXML is already avail­able, I’ve had some pre­vi­ous expe­ri­ence with it, and it’s a good choice for per­for­mant stan­dards-based XML pars­ing (using either DOM or SAX) and generation.

Giv­en more time and lee­way in adding depen­den­cies, I might use some­thing else. If the Vir­tuoz­zo API had an XML Schema or used SOAP, I would con­sid­er XML::Compile as I’ve had some suc­cess with that in oth­er projects. But even that uses XML::LibXML under the hood, so I’d still be using that. Your mileage may vary.

Generating XML

Depend­ing on the size and com­plex­i­ty of the XML doc­u­ments to gen­er­ate, you might choose to build them up node by node using XML::LibXML::Node and XML::LibXML::Element objects. Most of the mes­sages I’m send­ing to Vir­tuoz­zo Automa­tor are short and have eas­i­ly-inter­po­lat­ed val­ues, so I’m using here-doc­u­ment islands of XML inside my Perl code. This also has the advan­tage of being eas­i­ly val­i­dat­ed against the exam­ples in the documentation.

Where the inter­po­lat­ed val­ues in the mes­sages are a lit­tle com­pli­cat­ed, I’m using this idiom inside the here-docs:

@{[ ... ]}

This allows me to put an arbi­trary expres­sion in the … part, which is then put into an anony­mous array ref­er­ence, which is then imme­di­ate­ly deref­er­enced into its string result. It’s a cheap and cheer­ful way to do min­i­mal tem­plat­ing inside Perl strings with­out load­ing a full tem­plat­ing library; I’ve also had suc­cess using this tech­nique when gen­er­at­ing SQL for data­base queries.

Parser as an object attribute

Rather than instan­ti­ate a new XML::LibXML in every method that needs to parse a doc­u­ment, I cre­at­ed a pri­vate attribute:

package Local::API::Virtozzo::Agent {
    use Moo;
    use XML::LibXML;
    use Types::Standard qw(InstanceOf);
    ...
    has _parser => (
        is      => 'ro',
        isa     => InstanceOf['XML::LibXML'],
        default => sub { XML::LibXML->new() },
    );
    sub foo {
        my $self = shift;
        my $send_doc = $self->_parser
          ->parse_string(<<"END_XML");
            <foo/>
END_XML
        ...
    }
...
}

Boilerplate

XML doc­u­ments can be ver­bose, with ele­ments that rarely change in every doc­u­ment. In the Vir­tuoz­zo API’s case, every doc­u­ment has a <packet> ele­ment con­tain­ing a version attribute and an id attribute to match requests to respons­es. I wrote a sim­ple func­tion to wrap my doc­u­ments in this ele­ment that pulled the ver­sion from a con­stant and always increased the id by one every time it’s called:

sub _wrap_packet {
    state $send_id = 1;
    return qq(<packet version="$PACKET_VERSION" id=")
      . $send_id++ . '">' . shift . '</packet>';
}

If I need to add more attrib­ut­es to the <packet> ele­ment (for instance, name­spaces for attrib­ut­es in enclosed ele­ments, I can always use XML::LibXML::Element::setAttribute after pars­ing the doc­u­ment string.

Parsing responses with XPath

Rather than using brit­tle reg­u­lar expres­sions to extract data from the response, I use the shared pars­er object from above and then the full pow­er of XPath:

use English;
...
sub get_sampleID {
    my ($self, $sample_name) = @_;
    ...
    # used to separate documents
    local $INPUT_RECORD_SEPARATOR = "\0";
    # $self->_sock is the IO::Socket::SSL connection
    my $get_doc = $self->_parser( parse_string(
      $self->_sock->getline(),
    ) );
    my $sample_id = $get_doc->findvalue(
        qq(//ns3:id[following-sibling::ns3:name="$sample_name"]),
    );
    return $sample_id;
}

This way, even if the order of ele­ments change or more ele­ments are intro­duced, the XPath pat­terns will con­tin­ue to find the right data.

Conclusion… so far

I’m only about halfway through updat­ing these API calls, and I’ve left out some non-XML-relat­ed details such as set­ting up the TLS sock­et con­nec­tion. Hope­ful­ly this arti­cle has giv­en you a taste of what’s involved in XML pro­cess­ing these days. Please leave me a com­ment if you have any sug­ges­tions or questions.

Like this:

Load­ing…

Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK