32

In Defense of YAML

 5 years ago
source link: https://www.tuicool.com/articles/hit/YrURbeA
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

If you follow me on Twitter, you may think I hate YAML.

I'm not against YAML, just against abuse of YAML. I want to help prevent people  abusing YAML and being cruel to themselves and their coworkers in the process.

YAML's strength is as a structured data format. Yes, it has issues. Whitespace is a minefield. Its syntax is surprisingly complex. It has gotchas: " Anyone who uses YAML long enough will eventually get burned when attempting to abbreviate Norway ." But YAML is human readable and supports comments: two key benefits that drive its popularity.

Where it can go wrong is where we use YAML to describe behavior .

Consider some examples from the CI domain. This isn't the only domain in which YAML is abused this way, but it's among the worst offenders.

Take GitLab's pipeline definition for delivering itself : an 1170(!) line YAML file rife with sections like this:

gitlab:assets:compile:
  <<: *dedicated-no-docs-pull-cache-job
  image: dev.gitlab.org:5005/gitlab/gitlab-build-images:ruby-2.5.3-git-2.18-chrome-71.0-node-8.x-yarn-1.12-graphicsmagick-1.3.29-docker-18.06.1
  dependencies:
    - setup-test-env
  services:
    - docker:stable-dind
  variables:
    NODE_ENV: "production"
    RAILS_ENV: "production"
    SETUP_DB: "false"
    SKIP_STORAGE_VALIDATION: "true"
    WEBPACK_REPORT: "true"
    # we override the max_old_space_size to prevent OOM errors
    NODE_OPTIONS: --max_old_space_size=3584
    DOCKER_DRIVER: overlay2
    DOCKER_HOST: tcp://docker:2375
  script:
    - node --version
    - yarn install --frozen-lockfile --production --cache-folder .yarn-cache
    - free -m
    - bundle exec rake gitlab:assets:compile
    - time scripts/build_assets_image
    - scripts/clean-old-cached-assets
  artifacts:
    name: webpack-report
    expire_in: 31d
    paths:
      - webpack-report/
      - public/assets/

Note the script block containing a list of shell scripts. Does this look like data? Is this the right model for specifying execution?

There are many similar cases. Here is a fragment from an example of Tekton, a newish Kubernetes-based delivery solution:

apiVersion: tekton.dev/v1alpha1
kind: Task
metadata:
  name: build-push
spec:
  inputs:
    resources:
    - name: workspace
      type: git
    params:
    - name: pathToDockerFile
      description: The path to the dockerfile to build
      default: /workspace/workspace/Dockerfile
    - name: pathToContext
      description: The build context used by Kaniko (https://github.com/GoogleContainerTools/kaniko#kaniko-build-contexts)
      default: /workspace/workspace
  outputs:
    resources:
    - name: builtImage
      type: image
  steps:
  - name: build-and-push
    image: gcr.io/kaniko-project/executor
    command:
    - /kaniko/executor
    args:
    - --dockerfile=${inputs.params.pathToDockerFile}
    - --destination=${outputs.resources.builtImage.url}
    - --context=${inputs.params.pathToContext}

Ouch. Variables. Qualified names. Arguments. This is not structured data. This is programming masquerading as configuration.

Haven't we met concepts like variables and successive instructions before? Why clumsily reinvent imperative programming? What about modularity and testability? What about toolability, which we'd get for free with a programming language? Why reinvent exception handling, which is rigorously defined in modern languages? What about logical operations, let alone more advanced and elegant FP or OOP concepts?

The best argument in favor of such YAML-based syntax is that it's an external DSL , enforcing a beneficial structure. However, even this doesn't stack up, for several reasons:

  • The prescriptive structure is largely an illusion . The bulk of the work is pushed into shell scripts like this (from the GitLab example), which have no structure beyond the environment. In practice it's the Wild West.
  • If a step is missing in the design of the DSL, you hit a wall . For example, CI tools typically model delivery phases as YAML stanzas. If you need a unique phase, you're probably out of luck.
  • YAML is a poor format for an external DSL , just as XML was . The popular configuration format du jour is always misused this way.

You probably don't want an external DSL, anyway: something we learnt the hard way at Atomist.

External DSLs...are like puppies , they all start out cute and happy, but without exception turn into vicious beasts as they grow up.

Modern programming languages are flexible enough to make internal DSLs more and more compelling, with far superior tooling and extensibility.

Trying to use a data format as a programming language is wrong. Calling it out has nothing to do with the merits of the data format for what it was designed for.

YAML as data format is defensible. YAML as a programming language is not. If you're programming, use a programming language. You owe it to Turing, Hopper, Djikstra and the countless other computer scientists and practitioners who've built our discipline. And you owe it to yourself.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK