2

How to index XML books?

 3 years ago
source link: https://www.codesd.com/item/how-to-index-xml-books.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

How to index XML books?

advertisements

We have a collection of books stored as XML files. Each one is about 20 Mb in size. They all have the same regular structure which is roughly so:

<book>
<volume id="vI"><title>PRIMARY CARE MEDICINE</title>
    <part id="vIpA"><title>General Issues and Approach to Disease in Primary Care Medicine</title>
        <section id="vIpAs1"><title>Core Issues and Special Groups in Primary Care</title>
            <chapter id="vIpAs1ca"><title>Core Issues in Primary Care</title>
                <subchapter id="vIpAs1casc1"><title>Introduction</title>
                    <para>Praesent et venenatis ipsum.</para>
                    …
                </subchapter>
            </chapter>
            <chapter id="vIpAs1cb"><title>Other Issues</title>
                <para>Etiam maximus orci orci, eu aliquam nunc pretium id.</para>
                …
            </chapter>
        </section>
    …
    </part>
…
</volume>
</book>

We want to have them full text searchable with Lucene. Search results would show the titles within which the words occurs.

  1. What are the appropriate tools to index such a content? I came across several names like Solr, Tika or Digester but what they do is not clear to me.
  2. What if we now want to constrain search to certain element types (e.g. titles)? Do the same tools apply?

To extract contents from your XML files, you have a couple of options. For example, Java programming language sports a lot of libraries for XML processing. Those libraries are of course useable from Clojure, Scala or any JVM based language.
Second option is what you mentioned, Apache Tika.
The core of Apache Solr (and ElasticSearch by the way) is Apache Lucene. If you are using Apache Lucene, then a Java API is your only option. But what if you want to use PHP, Python or Erlang for example?
In a very simple words, what Apache Solr (and ElasticSearch) provides is an HTTP interface to Lucene API (and more things of course).

What if we now want to constrain search to certain element types (e.g. titles)? Do the same tools apply?

If we are talking about Lucene, Solr or ElasticSearch, then of course you can.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK