2

Using Your Vector Database as a JSON (Or Relational) Datastore

 3 weeks ago
source link: https://zilliz.com/blog/using-your-vector-database-as-JSON-or-relational-datastore
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Engineering

Using Your Vector Database as a JSON (or Relational) Datastore

Mar 04, 20247 min read

By Frank Liu

TL;DR: Vector databases support CRUD over "traditional" data formats such as JSON. If you're a solo developer or a small team and don't want to manage many different pieces of data infrastructure, you can use Milvus or Zilliz Cloud (the managed Milvus) as your only datastore and easily migrate vectorless collections to different databases as you scale.

Powered by the popularity of ChatGPT and other autoregressive language models, vector search has exploded in popularity in the past year. As a result, we've seen many companies and organizations hop on the vector search bandwagon, from NoSQL database providers such as MongoDB (via Atlas Vector Search) to traditional relational databases such as Postgres (via pgvector). The general messaging I hear around these vector search plugins is largely the same and goes something like this: developers should stick with us since you can store tables/JSON in addition to vectors, so there is no need to manage multiple pieces of infrastructure!

This kind of statement always cracks me up, as it's clearly crafted by unsophisticated marketing teams. Not only is the technology behind vector search vastly different from storage and querying strategies in relational & NoSQL databases, but it's fairly well-known now that vector databases can store relations, JSON documents, and other structured data sources. The first point is difficult to illustrate concisely without deep prior knowledge of database management systems, but the second point is fairly easy to show through some short sample code snippets. That's what this blog post is dedicated to.

Setting Up

Milvus stores data in units known as collections, analogous to tables in relational databases. Each collection can have its own schema, and schemas have a vector field of fixed dimensionality, e.g. 768 for vector embeddings based on e5-base. Let's create a collection to store JSON documents rather than vector data. To better illustrate this point, I've left out some of the earlier and latter steps, such as calling collections.connect:

from pymilvus import Collection, CollectionSchema, DataType, FieldSchema

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True, max_length=100),
    FieldSchema(name="_unused", dtype=DataType.FLOAT_VECTOR, dim=1)
]

schema = CollectionSchema(fields, "NoSQL data", enable_dynamic_field=True)
collection = Collection("json_data", schema)
collection.load()

Here, we've specified this collection to use Milvus dynamic schema capabilities, letting the collection to accept JSON payloads as "extra" data associated with each row.

I hope you weren't expecting more, 'cause that's it - it's really that simple.

CRUD Operations

You can interact with the collection we've created above as you would do with any other database. As an example, let's download the list of current US senators (in JSON format) from govtrack:

import requests
r = requests.get("https://www.govtrack.us/api/v2/role?current=true&role_type=senator")
data = r.json()["objects"]
len(data)

We can store these documents directly in Milvus with just a tiny bit of data wrangling:

rows = [{"_unused": [0]} | d for d in data]
collection.insert(rows)
(insert count: 100, delete count: 0, upsert count: 0, ...

From here, we can perform queries directly over that data:

collection.query(
    expr="party like 'Dem%'",
    limit=1,
    output_fields=["person"]
)
[{'person': {'bioguideid': 'C000127',
   'birthday': '1958-10-13',
   'cspanid': 26137,
   'fediverse_webfinger': None,
   'firstname': 'Maria',
   'gender': 'female',
   'gender_label': 'Female',
   'lastname': 'Cantwell',
   'link': 'https://www.govtrack.us/congress/members/maria_cantwell/300018',
   'middlename': '',
   'name': 'Sen. Maria Cantwell [D-WA]',
   'namemod': '',
   'nickname': '',
   'osid': 'N00007836',
   'pvsid': None,
   'sortname': 'Cantwell, Maria (Sen.) [D-WA]',
   'twitterid': 'SenatorCantwell',
   'youtubeid': 'SenatorCantwell'},
  'id': 447376465724036249}]

Without specifying any vector field, we've queried our database for the first result where the party field in the accompanying JSON payload has "Dem" (Democrat) as a prefix.

This step hopefully demonstrates the capability to perform structured data searches in Milvus, but it's not a particularly useful query. Let's find all senators from my home state of Oregon:

collection.query(
    expr="state in ['OR']",
    limit=10,
    output_fields=["person"]
)
[{'person': {'bioguideid': 'M001176',
   'birthday': '1956-10-24',
   'cspanid': 1029842,
   'fediverse_webfinger': None,
   'firstname': 'Jeff',
   'gender': 'male',
   'gender_label': 'Male',
   'lastname': 'Merkley',
   'link': 'https://www.govtrack.us/congress/members/jeff_merkley/412325',
   'middlename': '',
   'name': 'Sen. Jeff Merkley [D-OR]',
   'namemod': '',
   'nickname': '',
   'osid': 'N00029303',
   'pvsid': None,
   'sortname': 'Merkley, Jeff (Sen.) [D-OR]',
   'twitterid': 'SenJeffMerkley',
   'youtubeid': 'SenatorJeffMerkley'},
  'id': 447376465724036286},
 {'person': {'bioguideid': 'W000779',
   'birthday': '1949-05-03',
   'cspanid': 1962,
   'fediverse_webfinger': None,
   'firstname': 'Ron',
   'gender': 'male',
   'gender_label': 'Male',
   'lastname': 'Wyden',
   'link': 'https://www.govtrack.us/congress/members/ron_wyden/300100',
   'middlename': '',
   'name': 'Sen. Ron Wyden [D-OR]',
   'namemod': '',
   'nickname': '',
   'osid': 'N00007724',
   'pvsid': None,
   'sortname': 'Wyden, Ron (Sen.) [D-OR]',
   'twitterid': 'RonWyden',
   'youtubeid': 'senronwyden'},
  'id': 447376465724036331}]

Even though we specified limit=10, only two documents were returned (since each state has only two senators). Let's narrow down our query even more to get only the senior senator:

collection.query(
    expr="state in ['OR'] and senator_rank in ['senior']",
    limit=10,
    output_fields=["person"]
)
[{'person': {'bioguideid': 'W000779',
'birthday': '1949-05-03',
'cspanid': 1962,
'fediverse_webfinger': None,
'firstname': 'Ron',
'gender': 'male',
'gender_label': 'Male',
'lastname': 'Wyden',
'link': 'https://www.govtrack.us/congress/members/ron_wyden/300100',
'middlename': '',
'name': 'Sen. Ron Wyden [D-OR]',
'namemod': '',
'nickname': '',
'osid': 'N00007724',
'pvsid': None,
'sortname': 'Wyden, Ron (Sen.) [D-OR]',
'twitterid': 'RonWyden',
'youtubeid': 'senronwyden'},
'id': 447376465724036331}]

Perhaps I would like to update Senator Ron Wyden's profile with a bit more info. We can easily do so by retrieving the entire document with output_fields=["*"], updating the resulting document, and inserting it back into our database without the old primary key:

expr = "state in ['OR'] and senator_rank in ['senior']"
res = collection.query(
    expr=expr,
    limit=10,
    output_fields=["*"]
)
res[0].update({
    "elected_in": "1996",
    "college": "Stanford University",
    "college_major": "Political Science"
})
del res[0]["id"]
(insert count: 1, delete count: 0, upsert count: 0, ...
collection.delete(expr)
collection.insert(res)

Let's see if this worked as expected.

collection.query(
    expr=expr,
    limit=10,
    output_fields=["elected_in", "college", "college_major"]
)
[{'elected_in': '1996',
  'college': 'Stanford University',
  'college_major': 'Political Science',
  'id': 447376465724036353}]

The data indeed matches the updates we made.

The Full Script

Here's the whole script from front to back for convenience without the extraneous queries. I used our Zilliz Cloud free tier instead of milvus-lite:

from milvus import default_server
from pymilvus import connections
from pymilvus import Collection, CollectionSchema, DataType, FieldSchema
import requests


# Uncomment this if you're using `milvus-lite`
#default_server.start()
#connections.connect(host="127.0.0.1", port=default_server.listen_port)


# Uncomment this if you're using Zilliz Cloud
connections.connect(
    uri=os.environ["ZILLIZ_URI"],
    token=os.environ["ZILLIZ_TOKEN"]
)


# Create the schema for our new collection. We set turn on Milvus' dynamic
# schema capability in order to store arbitratily large (or small) JSON blogs
# in each row.
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True, max_length=100),
    FieldSchema(name="_unused", dtype=DataType.FLOAT_VECTOR, dim=1)
]
schema = CollectionSchema(fields, "Milvus as a JSON datastore", enable_dynamic_field=True)


# Now let's creat the collection.
collection = Collection("json_data", schema)
index_params = {
    "index_type": "AUTOINDEX",
    "metric_type": "L2",
    "params": {}
}
collection.create_index(
    field_name="_unused",
    index_params=index_params
)
collection.load()


# Insert US senator JSON data into our newly formed collection.
r = requests.get("https://www.govtrack.us/api/v2/role?current=true&role_type=senator")
data = r.json()["objects"]
rows = [{"_unused": [0]} | d for d in data]
collection.insert(rows)


# Fetch the first Democrat in the database
top_k = 1
collection.query(
    expr="state in ['OR'] and senator_rank in ['senior']",
    limit=top_k,
    output_fields=["person"]
)
[{'person': {'bioguideid': 'W000779',
'birthday': '1949-05-03',
'cspanid': 1962,
'fediverse_webfinger': None,
'firstname': 'Ron',
'gender': 'male',
'gender_label': 'Male',
'lastname': 'Wyden',
'link': 'https://www.govtrack.us/congress/members/ron_wyden/300100',
'middlename': '',
'name': 'Sen. Ron Wyden [D-OR]',
'namemod': '',
'nickname': '',
'osid': 'N00007724',
'pvsid': None,
'sortname': 'Wyden, Ron (Sen.) [D-OR]',
'twitterid': 'RonWyden',
'youtubeid': 'senronwyden'},
'id': 447376465724036331}]

Try it for yourself!

pymongo -> milvusmongo

One last note before wrapping up - I've created a small Python package called milvusmongo that implements pymongo's most basic CRUD functionality across collections. It uses Milvus as the underlying database rather than MongoDB. Like pymongo, milvusmongo supports both dictionary and attribute-style access and abstracts away extra logic needed for CRUD calls (i.e. insert_one, update_one, and delete_one). You can install it with pip install milvusmongo:

% pip install milvus
% pip install milvusmongo

Once that's done, you can start up an embedded Milvus instance and use it in conjunction with milvusmongo to perform queries over JSON data. For example:

from milvus import default_server
default_server.start()
from milvusmongo import MongoClient
client = MongoClient("127.0.0.1", 19530)
client.insert_one(my_document)

Please note that this library is meant to demonstrate Milvus' flexibility as a datastore rather than serve as something you should use in large-scale production environments.

Closing Words

Milvus isn't here to replace NoSQL databases or lexical text search engines; we're here to provide you with the best possible vector/filtered search experience. More importantly, we're here to help accelerate the adoption of vector search as a technology - it's why open source is such a core part of our ethos.

But that doesn't mean we don't support other types of data. As a solo developer or a small startup, you're free to use Milvus as your only data store. You can always optimize your infrastructure usage later as you grow. Milvus will provide you with best-in-class vector search capabilities, while other databases are there to store, index, and search other forms of data. Once your application starts requiring more complex workloads (such as joins or aggregations), that's when you'll want to contemplate using different data stores.

  • Frank Liu

    Frank Liu is the Director of Operations & ML Architect at Zilliz, where he serves as a maintainer for the Towhee open-source project. Prior to Zilliz, Frank co-founded Orion Innovations, an ML-powered indoor positioning startup based in Shanghai and worked as an ML engineer at Yahoo in San Francisco. In his free time, Frank enjoys playing chess, swimming, and powerlifting. Frank holds MS and BS degrees in Electrical Engineering from Stanford University.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK