How to index XML books?

advertisements

We have a collection of books stored as XML files. Each one is about 20 Mb in size. They all have the same regular structure which is roughly so:

<book>
<volume id="vI"><title>PRIMARY CARE MEDICINE</title>
    <part id="vIpA"><title>General Issues and Approach to Disease in Primary Care Medicine</title>
        <section id="vIpAs1"><title>Core Issues and Special Groups in Primary Care</title>
            <chapter id="vIpAs1ca"><title>Core Issues in Primary Care</title>
                <subchapter id="vIpAs1casc1"><title>Introduction</title>
                    <para>Praesent et venenatis ipsum.</para>
                    …
                </subchapter>
            </chapter>
            <chapter id="vIpAs1cb"><title>Other Issues</title>
                <para>Etiam maximus orci orci, eu aliquam nunc pretium id.</para>
                …
            </chapter>
        </section>
    …
    </part>
…
</volume>
</book>

We want to have them full text searchable with Lucene. Search results would show the titles within which the words occurs.

What are the appropriate tools to index such a content? I came across several names like Solr, Tika or Digester but what they do is not clear to me.
What if we now want to constrain search to certain element types (e.g. titles)? Do the same tools apply?

To extract contents from your XML files, you have a couple of options. For example, Java programming language sports a lot of libraries for XML processing. Those libraries are of course useable from Clojure, Scala or any JVM based language.
Second option is what you mentioned, Apache Tika.
The core of Apache Solr (and ElasticSearch by the way) is Apache Lucene. If you are using Apache Lucene, then a Java API is your only option. But what if you want to use PHP, Python or Erlang for example?
In a very simple words, what Apache Solr (and ElasticSearch) provides is an HTTP interface to Lucene API (and more things of course).

What if we now want to constrain search to certain element types (e.g. titles)? Do the same tools apply?

If we are talking about Lucene, Solr or ElasticSearch, then of course you can.

How to index XML books?

How to index XML books?

Recommend

TI Claims an Industry First—a DC/DC Controller With an Integrated Active EMI Fil...

Should I keep instance variables in Java always initialized or not?

Build stunning mobile games that run smoothly with Adaptive Performance

Multiplatform App Tutorial: SwiftUI and Xcode 12

Agora, the technology provider behind hit audio app Clubhouse, says it does not...

How AI Is Making Software Development Easier For Companies And Coders

What is an Open Source Company?

Understanding the New Offshore Wind Farms Initiative

Kubernetes ConfigMap详解，多种方式创建、多种方式使用

康弘药业：用行动践行创新为民

About Joyk