Content that works everywhere — a history and a future of cross-use rich text formats

Jun 17th 2022 engineering

Remember when “content” looked like this?

Original form of punched card content graphic

Generously offered under the Creative Commons Attribution 2.0 Generic license by Pete Birkinshaw on Wikimedia.

Ah, the good old days.

Nowadays, “content” often looks like this:

How far technology has come.

Don’t get me wrong, both of these formats (punched card and WAV, respectively) have their merits and are actually quite good at their respective tasks. But the fact remains that for literally decades now, we’ve been searching for a single format to contain “content” of all types. We haven’t quite failed — many attempts have been made with varying amounts of success — but we haven’t exactly hit a home run, either.

Most attempts at this have (appropriately) narrowed the scope to just rich text. After all, it’s reasonable to drop support for clearly distinct types of content like video if we could create something superior that only worked for rich text.

So in this article, I’m going to delve into three of the best-known attempts at cross-platform, fully-featured rich text formats of the past — XML, Markdown, and JSON — and three modern takes on the problem — Notion, ExtraMark, and Sanity’s Structured Content.

When XML was created, it was forced to answer the question: “What stuff are we storing?”. Comparable formats like HTML had a definitive answer to that question (in its case, website layout). XML’s answer was pretty much “whatever you want”, hence the eXtensible in the acronym.

That can be a good thing, at times. For example, there’s a spec called NITF used by the news industry to format feeds that other tools can consolidate into cross-journal article collections like Google News. That’s only possible because XML let the creators of NITF pick whatever tags they wanted to include.

That extensibility can be a bad thing, at times, too. That lack of specificity in the format means learning XML doesn’t actually teach you a whole lot about using it. You still need to worry about the details of the specification because that’s what will tell you how to actually access the data stored in the XML.

That extensibility also makes XML files unnecessarily long! Say I want to create a heading for my content. In XML, I need to arbitrarily define the <heading> element, or whatever I’d like to call it, and make sure that the existence of that element is clear to whoever will read this data later on. Often that’s done with a Document Type Definition. Then, we’ve got to somehow convey the meaning of this information, because besides the rather self-explanatory element title, the element doesn’t really tell us much. What exactly is a heading? Two different people reading this data may interpret <heading> differently. All of this takes up a lot of space — it’s so verbose that later efforts ended up swinging to the opposite extreme.

Markdown

The creators of Markdown saw the mess that was defining something as simple as a heading in XML, thought it was ridiculous, and came up with this:

I’d call that the opposite extreme.

Wrapped up in that tiny little hashtag (pound sign? just hash? sharp? octothorpe?) is the entire concept of the highest-order heading. It posits that the topic of the content that the heading is for is irrelevant, so it doesn’t need to have different markers for <ArticleTitle>, or <ProductName>, or <CompanyName>. A single mark for the highest-order heading does just fine.

That leads to the opposite problem that XML had! Where XML shined in its flexibility, Markdown doesn’t. In fact, when you want to store a product that can be sold through your website, it’d seem rather unnecessary to have a field in your database known just as heading. No, you’d likely want that to be more descriptive, especially because your product’s name will sometimes show up as a heading, or sometimes as a list item, or sometimes in plain text on the page. So in the end, while Markdown makes presentation on the Web easier by catering to actual website layout design, it doesn’t actually label the information it stores usefully, which is the point of a rich text format in the first place. It’s convenient in some use cases, but at the cost of being pointless in almost every other use case.

But even using Markdown where it was intended has an issue or two. For example, you’d expect that if it’s going to hone in on representing elements in web design, it’d go all in. Yet Markdown doesn’t have an analog for a significant chunk of the possible ways of representing rich text in HTML. Add to that the conundrum that there technically isn’t one single standard for Markdown, which features it does implement is a bit up in the air. The positive is that this can be easily fixed! Let’s put a pin in this for now before coming back to the most recent attempts to fix it.

In 2001, a developer decided to serialize a JavaScript object so that it could be sent as a payload in a protocol that could only send strings, like HTTP. Because it was familiar (starting out as a subset of JavaScript), but also cross-environment (drivers are available from pretty much every language), JSON quickly became one of the leading data transmission formats. Because it could store and transport nearly anything though, it wasn’t long before folks started trying to use it for rich text.

But as seems to be the recurring theme of this article so far: there was a fatal flaw. We had swung back to the infinite extensibility of XML, so we’re back to where learning the format doesn’t teach us anything about using it. Properties in JSON are a little clearer than in XML (this is my highly subjective opinion, but given their relative popularities, I’d say a lot of people agree with me here), so even the “content” of the block some object is describing is a property too. That might sound a smidgeon confusing spelled out, but here’s an example:

XML


<product-description
	sold-out="true"
>
	This is the product description.
</product-description>

JSON


{
	"product-description": {
		"sold-out": true,
		"content": "This is the product description."
	}
}

For comparison, the XML example uses two different types of storing data (split out for demonstration, not because this is best practice). You could either store simple data as an attribute on the parent element itself, or as child nodes of that parent element. When to use each of these methods can get a little confusing, especially if you don’t know the eventual development direction of the program that will consume this XML. On the other hand, the JSON example treats all properties equally. They can be strings, booleans, numbers, null, arrays, or sub-objects, all of which are datatypes most developers are familiar with. We don’t need to treat them all like strings for the sake of the format (see sold-out in the XML example).

I personally am a fan of JSON, not because of some objective measure, but because I’m a JavaScript developer and it feels nice to use. Interestingly though, JSON seems to be one of the most common ways forward in storing rich text. While formats are still needed for offline documents (think DOCX, an XML spec, for Microsoft Word), the vast majority of documents created today are for sharing publicly on the World Wide Web, which we typically access with JavaScript-capable browsers. So XML does have a place, but there’s a growing argument to be made that since rich text is primarily for the browser, JSON is about as native of a cross-environment (read: not HTML) format to store it in as we’re going to find.

Notion

We’ve talked a fair bit now about attempts of the past — let’s talk a bit about the present and the future. I originally got the notion (see what I did there?) to write an article like this after examining my own workflow and how it’s progressed from what technical authors used years ago. And I found that the biggest improvement was Notion, the note-taking app that I use to write these articles. I dug into Notion’s API to figure out how they store the content of the rich text I’m writing right now. Take a look at what I get when I query for the heading of this section:

{
	"object": "list",
	"results": [
		{
			"object": "block",
			"id": "00000000-this-long-uuid-000000000000",
			"created_time": "2022-05-24T02:55:00.000Z",
			"last_edited_time": "2022-05-24T02:55:00.000Z",
			"created_by": {
				"object": "user",
				"id": "0another-very-long-uuid-000000000000"
			},
			"last_edited_by": {
				"object": "user",
				"id": "0another-very-long-uuid-000000000000"
			},
			"has_children": false,
			"archived": false,
			"type": "heading_2",
			"heading_2": {
				"rich_text": [
					{
						"type": "text",
						"text": {
							"content": "Notion",
							"link": null
						},
						"annotations": {
							"bold": false,
							"italic": false,
							"strikethrough": false,
							"underline": false,
							"code": false,
							"color": "default"
						},
						"plain_text": "Notion",
						"href": null
					}
				],
				"color": "default"
			}
		}
	],
	"next_cursor": null,
	"has_more": false
}

Okay, so definitely JSON.

The question is, how did they use the underlying technology to match the use case? Well, I noticed a couple of things:

Every object of data is identified with a UUID. Things like the actual content of the block don’t count — I’m talking about discrete objects like blocks themselves, users, pages, etc. This alone takes away one of the biggest disadvantages of building complex data structures in JSON by allowing you to reference other objects without repeating their content. It’s second only to how many query languages like GraphQL do it. How convenient it is then that NotionQL exists.
Maybe this is a small thing, but I liked that they didn’t just have a string at the annotations property that lists some set of annotation options, like bold-underline-italic. JSON inherits JavaScript’s simple, easy-to-understand booleans, so Notion has gone with the option of creating an annotations object, with each annotation option being its own boolean. That means they don’t have to worry about the order that the annotations are given in, and they don’t have to worry about future changes (like adding a superscript annotation option, for example) breaking everything.
One benefit of XML is that it requires tag names! It’s easy to gloss over that, but those tag names help define what each element actually is. Notion, here, has made sure that their JSON objects aren’t unlabelled. They actually have a consistent property on each discrete object (see #1 of this list) called object, which tells you what the object you’re reading actually is. Line 5 tells you “you’re reading information about a block of content”, and line 10 says “you’re now looking at an object that represents a user”. In XML, those would be <block> and <user> tags, so it’d be super clear — but Notion’s consistent application of this simple scheme gives JSON the same advantage.

With these advantages, Notion has created a system that other tools might do well to adopt and even expand upon until it becomes a complete specification! I’m struggling to find any “illogical” parts here, per se, even if I personally could see the benefits of taking a different approach. For example, Notion expanded on JSON — an apt decision, given the block-based nature of the program — but they still take Markdown as input and output, so they’re still somewhat limited by what Markdown can support. I personally use a lot of toggle lists and side-by-side elements, neither of which are supported in Markdown but are in Notion. Regardless, Notion has set up an excellent system, and I’m really excited to see it evolve.

ExtraMark

Let’s take out the pin we put in Markdown earlier — and to recap, I was just complaining that Markdown specializes in representing rich text on the web, but doesn’t have enough marks to represent all the types of rich text on the web, limiting its use for its intended purpose. But the situation drastically improves when we just add some of those missing elements. The usefulness of Markdown off the web doesn’t change, but some new flavors of Markdown can make it perfect for mapping directly to HTML rich text.

Take for example, ExtraMark. It’s a superset of CommonMark, one of the most popular and generally agreed-upon flavors of Markdown. But ExtraMark took it a step further and started adding in other highly useful features. Here’s the list from their GitHub README:

Automatic typographic replacements
Tables
Anchors for headings (up to heading level 3)
Definition lists
Superscript
Subscript
Abbreviations
Footnotes
Critic Markup

Amazing. I can’t count how many times — despite personally despising working with tables — I’ve googled how to make one with Markdown and found myself stuck. Now, it’s possible! Definition lists are another feature that probably should be used more often (it’s semantically quite valuable, it’s just a bit obscure), and now that can be used in Markdown too!

I should add a little footnote to that last statement, though. We can technically use footnotes and subscript in Markdown, but whatever will be parsing or displaying that Markdown needs to support ExtraMark, and I’ve actually never heard of that implemented. This repo has 4 GitHub stars — it’s not a common tool. And that’s a shame! ExtraMark is the most logical, but still powerful proposed spec for Markdown that I’ve seen yet! Because it’s completely compatible with CommonMark, if you need a Markdown parser for your next project, I recommend choosing this one! Everything you’ve written so far will still work, but now you’ll have all of these features at your fingertips.

Structured Content

Let’s jump back to JSON now — personally, I think there’s only one format that has outdone Notion. I’d like to introduce you to Sanity’s Structured Content.

While it isn’t necessary to use Sanity to use Structured Content, it’s linked closely enough with their data storage platform to be an optional benefit instead of a mandatory burden. If you need somewhere to store that data that you’re transporting around in Structured Content, you can put it in Sanity knowing they’ll handle the formatting and everything for you.

Structured Content also includes a lot more than just rich text! It has built-in tools for (as the name suggests) structuring content — the models that your content follows live right alongside everything else! You can store logic and custom algorithms to modify your data when needed (side note, as wild as this sounds, this part is actually pretty common in rich text — think JavaScript dynamically modifying HTML, or those silly programmable animations in PowerPoint), you can easily loop in external services (also not unheard of — there are even workarounds for this in Markdown), and images are automatically customized when they’re displayed (this was always possible, but used to require an external service — Sanity bakes it into your rich text).

The best part (in my opinion) is that you can strictly use the rich text specification without all of the Sanity-specific features (where much of that logic actually runs). By itself, the spec is called Portable Text, and it’s incredibly well-supported by drivers and parsers developed by the Sanity team. They’ve put quite a bit of effort into not locking you into their platform. As of May 2022, they have publicly-available libraries that take Portable Text as input and spit out pure HTML, Markdown, Vue components, React components, Svelte components, and Hyperscript (I haven’t seen this until now, but it looks amazing). If you’re working in other programming languages, they’ve got you covered too — not that I ever want to work in PHP again, but if I did, at least I’d be comforted knowing I could use Portable Text.

Looking back

Well, this was quite an in-depth article. I’m already at nearly three thousand words, but I think we spent it well; diving into the history of rich text formats is a useful exercise, given that it helps us understand any unsolved problems or shortcomings so that we can work to improve it.

Realistically, many of us may not be involved in creating one of the rich text specs of the future, but in all likelihood, we’ll have to choose one to use, and the cursory understanding we reviewed here will help inform those decisions. Maybe plain-old XML, Markdown, and JSON are fading away, but what format will you use to pass around rich text in your next project? If you want my advice — and you read this far, so why not — reach for ExtraMark if you’re going the Markdown route or Portable Text for everything else.

Thanks for reading and I’m looking forward to seeing to what you create.

Content that works everywhere — a history and a future of cross-use rich text fo...

Content that works everywhere — a history and a future of cross-use rich text formats

Markdown

Notion

ExtraMark

Structured Content

Looking back

Recommend

How To Clear Cache On Your PS5

Bill Nye's return to TV debuts August 25th on Peacock

FDA clears COVID-19 vaccines for children under 5

大数据平台核心架构图鉴，建议收藏！

MacVoices: The MacVoices Live! Panel discuss how podcasts have evolved

Open House Group

员工忠诚度取决于薪水的高低？

Voters 'Delusional' As Accused Capitol Rioter Leads Polls: Ex-GOP Leader

房地产DeFi初创公司Rigor完成350万美元种子轮融资

Bitcoin Magazine Partners With BitMEX For Bitcoin Content Deal - Bitcoin Magazin...

About Joyk