通过字节码保护Node.js源码之原理篇

上海和今信息科技有限公司技术总监

对于商业软件，在发布时防止逆向和篡改是很常见的需求，Node.js 作为脚本语言，在这方面天然存在不足，本文探讨一种通过发布V8引擎编译后的字节码来保护源码的思路。

本文受到了 bytenode 项目的启发，特此鸣谢

基础知识

什么是字节码(Byte Code)?

定性来看，字节码是一种源码编译后的中间表示，与汇编有类似之处，但是运行于特定的语言虚拟机中，因此它其中包含的指令不是机器指令，而是平台无关的，由虚拟机实现的虚拟机指令。

一句话版本：字节码是一种运行于某种虚拟机中的，更抽象的汇编。

2. 字节码为什么可以保护源码，与代码混淆有本质区别吗？

字节码经历了完整的编译流程，抹除了源码中携带的额外语义信息，其逆向难度可以与传统的编译型语言相比。

代码混淆仅仅只是障眼法，也无法防止篡改者在混淆后的源码中加入探针和钩子代码。

两者的逆向难度存在数量级的差异。

正题

V8 中的字节码

关于 V8 是如何执行 JS 代码的以及 V8 中的字节码长啥样，此处不再赘述，参考劳模火狐哥哥的这篇译文：

从这篇文章中，我们可以获得的信息有：

在 V8 的运行流程中，代码会被编译为字节码。
字节码不是一个暴露在外的功能，其格式未标准化，与 V8 的版本紧密相关。
如果直接把字节码喂给 V8，可以略过 Parser 和 AST，获取一定的性能提升。

2. Hello, Byte Code

在这一部分，我们将尝试将一段普通的 JavaScript 代码编译为字节码并 dump 到磁盘，之后脱离源码直接运行它。

本文中的代码可在这个 Repo 中找到， Node版本为V12。

我们先来了解一下 Node 提供的 VM 模块。

The vm module enables compiling and running code within V8 Virtual Machine contexts. JavaScript code can be compiled and run immediately or compiled, saved, and run later.

简而言之，VM 模块提供了操作 V8 编译能力和 V8 虚拟机一组API。文档地址

先写一个朴实无华的 hello.js：

function sayHello(more = []) {
  console.log(['Hello', 'Byte Code', ...more].join(', '));
}

sayHello();

然后写一个 compile.js ，调用 vm.Script 生成字节码，并 dump 到磁盘：

const vm = require('vm');
const fs = require('fs').promises;
const v8 = require('v8');
v8.setFlagsFromString('--no-lazy');

async function compileFile(filePath) {
  const code = await fs.readFile(filePath, 'utf-8');
  const script = new vm.Script(code);
  const bytecode = script.createCachedData();
  await fs.writeFile(filePath.replace(/\.js$/i, '.bytecode'), bytecode);
}

compileFile(process.argv[2]);

$ node compile.js hello.js

至此，我们已经有了hello.bytenode，但是 node 并不能直接运行它，我们还需要写一个 loader来加载。

loader.js

const fs = require('fs').promises;
const vm = require('vm');
const v8 = require('v8');
v8.setFlagsFromString('--no-flush-bytecode');

async function loadBytecode(filePath) {
  const script = new vm.Script('', {
    cachedData: await fs.readFile(filePath, null)
  });

  if (script.cachedDataRejected) {
    throw new Error('something is wrong');
  }
  return script;
}

if (process.mainModule.filename === __filename) {
  const scirpt = loadBytecode(process.argv[2]);
  scirpt.runInThisContext();
}

module.exports.loadBytecode = loadBytecode;

看起来不错，运行。

$ node loader.js hello.bytecode

哦豁，输出 "something is wrong"。

3. Dive into V8

接下来就是最麻烦的部分，V8对于传入的字节码实际上存在某些校验机制，所以上面这种写法会导致cachedDataRejected，因为V8认为这份字节码是无效的。

我们去V8的源码里一探究竟。

在 code-serializer.h 中，有注释说明了 bytecode 的 header 部分包含了哪些信息。

  // The data header consists of uint32_t-sized entries:
  // [0] magic number and (internally provided) external reference count
  // [1] version hash
  // [2] source hash
  // [3] flag hash
  // [4] number of reservation size entries
  // [5] payload length
  // [6] payload checksum part A
  // [7] payload checksum part B
  // ...  reservations
  // ...  code stub keys
  // ...  serialized payload

需要结合具体代码搞清楚这些字段的含义。

搜索一番后，定位到这一行。

我们来看看这个 SanityCheck 里面做了什么。

联系上下文代码，整理出可能影响有效性的字段。

字段语义magic number无version hashV8的版本source hash源代码字符串的长度flag hashV8启动的参数

其它的参数都好理解，比较特殊的是 flag hash，它受 node 进程运行时的 flags 影响，所以不是一个常数，这也会导致我们 dump 到磁盘的 bytecode 被丢弃。

现在有三个问题：

(1) 需要知道 bytecode 对应的源码的长度，并在加载它时传入对应长度的任意字符串

这个问题相对好处理，我们可以直接从 bytecode 的 header 中读取 source hash，并且伪造。

(2) 需要让 bytecode 的 flag hash 与当前进程的一致

在当前进程中重新编译任意代码的 bytecode，从结果中获取flag hash ，再 patch 到我们要加载的 bytecode 中即可。

(3) 需要一个能够读写 bytecode header 的工具函数

幸好 header 部分没有 checksum :D

所以新的 loader.js 如下：

const fs = require('fs')
const vm = require('vm');
const v8 = require('v8');
v8.setFlagsFromString('--no-flush-bytecode');

const HeaderOffsetMap = {
  'magic': 0,
  'version_hash': 4,
  'source_hash': 8,
  'flag_hash': 12
};

let _flag_buf;

function getFlagBuf() {
  if (!_flag_buf) {
    const script = new vm.Script("");
    _flag_buf = getHeader(script.createCachedData(), 'flag_hash');
  }
  return _flag_buf;
}

function getHeader(buffer, type) {
  const offset = HeaderOffsetMap[type];
  return buffer.slice(offset, offset + 4);
}

function setHeader(buffer, type, vBuffer) {
  vBuffer.copy(buffer, HeaderOffsetMap[type]);
}

function buf2num(buf) {
  // 注意字节序问题
  let ret = 0;
  ret |= buf[3] << 24;
  ret |= buf[2] << 16;
  ret |= buf[1] << 8;
  ret |= buf[0];

  return ret;
}

function loadBytecode(filePath) {
  const bytecode = fs.readFileSync(filePath, null);

  setHeader(bytecode, 'flag_hash', getFlagBuf());

  const sourceHash = buf2num(getHeader(bytecode, 'source_hash'));
  const script = new vm.Script(' '.repeat(sourceHash), {
    cachedData: bytecode
  });

  if (script.cachedDataRejected) {
    throw new Error('something is wrong');
  }
  return script;
}

if (process.mainModule && process.mainModule.filename === __filename) {
  const scirpt = loadBytecode(process.argv[2]);
  scirpt.runInThisContext();
}

module.exports.loadBytecode = loadBytecode;

运行之，大功告成。

一个细节：

实际上 node 的 v8 模块提供了一个导出函数 v8.cachedDataVersionTag()，其说明为：

Returns an integer representing a "version tag" derived from the V8 version, command line flags and detected CPU features.
This is useful for determining whether a vm.Script cachedData buffer is compatible with this instance of V8.

而其实现为：

SanityCheck 中似乎并未校验 CPU 相关的字段，并且 bytecode Header 中也没有此定义。我猜测这个是预留给 Optimized Machine Code 的实现，有空再研究一下。

4. Last Piece

在实际场景中，我们的应用不是单个 js 文件，而是由 require 连接起来的复杂系统，编译为字节码后，如何处理这些关系呢？

（1）处理 CommonJS 模块

当一个模块代码被引入时，Node 会自动为其加上一个包裹函数，详情参考文档。

对于 .js 文件，包裹在 require 时发生，直接操作读入的模块代码字符串，但是对于 bytecode ，在编译时完成更为合适。

所以 compile.js 需要一个升级，手动加上包裹函数,用来支持模块的导出。

const vm = require('vm');
const fs = require('fs').promises;
const _module = require('module');

async function compileFile(filePath) {
  const code = await fs.readFile(filePath, 'utf-8');
  const script = new vm.Script(_module.wrap(code));
  const bytecode = script.createCachedData();
  await fs.writeFile(filePath.replace(/\.js$/i, '.bytecode'), bytecode);
}

compileFile(process.argv[2]);

注意：这样处理过的 hello.js 的 bytecode 直接加载后并不会输出 Hello Byte Code，因为等效于以下代码：

(function(exports, require, module, __filename, __dirname) {
  function sayHello(more = []) {
    console.log(['Hello', 'Byte Code', ...more].join(', '));
  }
  sayHello();
})

这里涉及到 CommonJS 模块的导出导入机制，可能有些令人困惑, 建议参考朴老师的《深入浅出Node.js》。

（2）处理 require

我们通过拓展 require 来实现无感加载 bytecode 文件。

这里有一个前提，就是应当省略文件后缀，例如使用 require('./foobar')的写法。如果你已经使用 TS，你应当非常熟悉这类约定。

hook-require.js

const _module = require('module');
const path = require('path');

const { loadBytecode } = require('./loader');

_module._extensions['.bytecode'] = function (module, filename) {
  const script = loadBytecode(filename, false);
  const wrapperFn = script.runInThisContext({
    filename: filename
  });

  // 这里的参数列表和之前的 wrapper 函数是一一对应的
  wrapperFn.bind(module.exports)(module.exports, require, module, filename, path.dirname(filename));
}

同时，我们也准备好新的 test-require.js

require('./hook-require');
const hello = require('./hello.bytecode');

console.log(hello);

hello.sayHello(['required']);

与新的 hello.js

function sayHello(more = []) {
  console.log(['Hello', 'Byte Code', ...more].join(', '));
}

module.exports.sayHello = sayHello;
module.exports.stringExport = "foobar";

sayHello();

至此，我们已经基本跑通交付V8字节码而非源码的基本流程。

之后我会再写一篇，主要内容大概有

选择字节码方案的原因，与 pkg 等方案的对比分析
在实际工程项目中使用字节码的坑和解决方案
如果不鸽的话谈一谈类似的技术选型要考虑些什么

感谢阅读。

通过字节码保护Node.js源码之原理篇

通过字节码保护Node.js源码之原理篇

正题

Recommend

Recovering Nutanix from a Node Failure [Video]

VDI Calculator 7.2 Now Available w/ Support for Workspot - myvirtualcloud.net

EXT4 vs. XFS vs. ASM vs. ASM + OEL, which one performs better? Taking it to the...

Multiple View Composer Servers against single vCenter

Nine Ways Smore Empowers Educators to Create and Share Great Content

Smore Love Stories ❤️

How Better Holiday Newsletters Can Help Grow Your Business

Book Summary: Ultralearning by Scott.H. Young

每周以太坊进展 2021/03/21

vExpert Cloud Management February 2021 Blog Digest - VMware Cloud Management

About Joyk