Exporting Amazon Neptune Data to a Gremlin Server

 1周前 阅读数 0
以下为 快照 页面,建议前往来源网站查看,会有更好的阅读体验。
原文链接: https://spin.atomicobject.com/2021/10/13/exporting-neptune-data-gremlin-server/

Exporting Neptune Data to a Gremlin Server

Amazon Neptune is a service for hosting graph databases in the cloud. Managed services are great, but sometimes you just need to run some tests or do some other integration work without depending on a cloud. Although Amazon does not offer a way to run Neptune locally, Apache Gremlin Server is a close enough approximation for testing and development purposes. It would be even better if you could clone the data from a Neptune database into an Apache Tinkerpop instance.

In the past I got used to the ease of something like this to replicate a remote database on my local machine:

ssh database-server mysqldump some-db | mysql some-db

Mysqldump was great because it produced a plain text file full of SQL statements that could be used to reconstitute the original database. And since it was a text file (not some proprietary binary serialization), the results could be edited by hand or processed by other scripts before loading into another database.

Such a simple equivalent does not appear to exist for graph databases. But there are tools to perform parts of this operation. They just require some glue to stick them all together.

Things you will need:

  • Maven
  • Python 3
  • Gremlin Server
  • Gremlin Console
  • Some shell scripting

Set up Gremlin Server.

Gremlin Server is the reference implementation of Apache TinkerPop and the Gremlin traversal language. And while AWS Neptune is a TinkerPop-enabled graph database, it does have a few differences. So far the only difference I’ve found to be significant is the representation of IDs.

Gremlin Server uses long integers to identify vertices and edges by default. But Neptune uses UUIDs. While relationships in a graph database are typically modeled as edges, IDs may still be important for other reasons. So it would be best to preserve as much as possible, including IDs.

Fortunately, Gremlin Server can be reconfigured to use UUIDs. Edit conf/tinkergraph-empty.properties, changing this:


to this:


Once this is done, you should be able to start the server using one of the scripts in bin/. We’ll come back to this in the final step.

Export data from Neptune.

The next couple of steps depend largely on Amazon Neptune Tools. Clone this repository, and we’ll get started.

git clone https://github.com/awslabs/amazon-neptune-tools.git

The first tool we need is under neptune-export. The output of this tool is a collection of CSV files for all the edges and vertices in the graph. The format of the CSV files is the same as the one Neptune uses for loading data (which doesn’t matter for our purposes but could be useful).

To get this thing running, you’ll first need to install Maven. If you’re on MacOS:

brew install maven

Then a quick build using Maven:

mvn clean install

Finally, we’re ready to export some graph data. You’ll need some way to get at your Neptune database endpoint. Whether that’s an SSH tunnel, VPN, or other AWS magic, you should have a connection string like “wss://localhost:8183/gremlin” (in my case, I’m using an SSH tunnel and forwarding port 8183, to keep it separate from the Gremlin server running on port 8182 on my local machine).

The Neptune-export tool decomposes the elements of this connection string into separate command-line switches, so you’ll need to do something like this:

bin/neptune-export.sh export-pg -e localhost --port 8183 -d output-dir

When you run this, the tool will crawl your graph and write CSV files to a randomly-named subdirectory of the directory given by the -d flag.

Convert the data to Gremlin format.

So now we have all our graph data in CSV format. This is nice, but it can’t be directly imported into Gremlin Server. Fortunately, Amazon Neptune Tools contains another tool to help us out.

Head over to the csv-gremlin directory. This one requires Python 3 to run, and I also had to install a library:

pip3 install python-dateutil

This tool also has some important command-line switches to be aware of.

We need the -escape_dollar and -java_dates switches because we’ll be feeding the output to the Gremlin Console, which uses a language called Groovy. And since Neptune stores timestamps in UTC, you’ll want -assume_utc as well.

Putting it all together, we can process a single CSV file and get the resulting groovy script on stdout:

python3 csv-gremlin.py -escape_dollar -java_dates -assume_utc csv-file.csv

The neptune-export tool will have produced many separate CSV files. This is because the first line of the CSV lists properties and their types, which tend to vary among vertices and edges with different labels. So you’ll want to loop over all those CSV files and collect all the output into a single groovy script (just be sure to process vertices before edges). It might look something like this:

for csv in ../neptune-export/output-dir/*/{nodes,edges}/*.csv; do
    python3 csv-gremlin.py -escape_dollar -java_dates -assume_utc $csv >> data.groovy

The final step is to run data.groovy using the Gremlin Console. If you haven’t used the Gremlin Console yet, you’ll want to create an “autoexec” groovy script for connecting to your gremlin server. This is what I use for connecting to my local gremlin server:

cluster = Cluster.build('localhost').port(8182).create()
:remote connect tinkerpop.server cluster
:remote console

Sadly, this won’t actually execute automatically, but you can tell the Gremlin Console to run scripts at startup using the -i (interactive) or -e (execute and quit) flags. So, combining this with our data script, we can finally import the graph data like so:

bin/gremlin.sh -e autoexec.groovy -e data.groovy

It’s time to run that Gremlin Server.

Obviously this process could use some scripting! But at least all the major pieces are there, ready to be customized for any particular graph database setup.