CodeQl 从0到0.1

2022年03月, 12 views, #代码审计 #codeql #static analysis

本文记录下CodeQL使用和学习过程中积累的知识。首先是CodeQl通用的知识比如各种query的写法，CodeQL CLI的使用，还有CodeQl For JavaScript库的相关使用。

CodeQl General

Metadata for CodeQl queries

描述ql文件信息的注释，位于ql最上面。（注@id中的字符需全部小写）

/**
 * @name backExtractBlockerFromPvn
 * @description backExtractBlockerFromPvn
 * @kind path-problem 
 * @problem.severity warning
 * @tags security
 * @id js/back-extract-blocker-from-pvn
 */

query kind

@kind指示了本query的类型，常用有两种类型

Alert queries: queries that highlight issues in specific locations in your code.

Path queries: queries that describe the flow of information between a source and a sink in your code.

Alert query 用于展示CodeQl查询出来的相关node信息和描述文字。Path query用于展示Codeql查询出从source流向sink的完整路径信息。在使用CLI进行查询的时候，Alert query必须在meta信息中标明@kind为problem，而Path query则为path-problem，使用vscode不包含meta信息对于path query会报个错。

其他还有@kind为diagnostic和metric的Diagnostic query和Summary query，使用较少。

Write a query

不同的query类型对于result的格式不太相同，在使用CLI进行查询的时候，必须满足对应格式才能得到结果。

alert query

对于Alert query，select 出的结果分为两部分，element和string。element代表需要展示的节点，string为这个节点描述信息。

比如对下面js文进行分析，select出可能会流向exec函数的参数。

const exec = require("child_process").exec

function func1(data){
    exec(data);
}

var input = userInput();
func1(input);

Alert query

/**
 * @name testForAlertQuery
 * @kind problem
 * @problem.severity warning
 * @tags correctness
 * @id js
 */

import javascript
import DataFlow

from ParameterNode pn, SourceNode exec
where exec = DataFlow::moduleMember("child_process", "exec") and 
pn.getASuccessor*() = exec.getACall().getArgument(0)
select pn, "Function " + pn.getParameter().getEnclosingFunction().getName() 
+ " parameter " + pn.getName() + " flows to exec function."

其中pn就是想要select出来的会流向exec函数的函数参数，后面的string详细地描述了哪个函数(func1)的哪个参数(data)会流向exec。

path query

对于Path query，select出来的结果分为四个部分。element，source，sink，string，其中element 和string和Alert query 是相同的，而source和sink就分别表示数据流分析中的souce和sink对应得PathNode

/**
 * @name testForPathQuery
 * @kind path-problem
 * @problem.severity warning
 * @tags correctness
 * @id js
 */

import javascript
import DataFlow
import PathGraph

class ExecConfiguration extends Configuration {
    ExecConfiguration() { this = "ExecConfiguration" }
    override predicate isSource(DataFlow::Node source) {
        exists(CallExpr pn |
            pn.getCalleeName() = "userInput" |
            source.asExpr() = pn
        )
    }
    override predicate isSink(DataFlow::Node sink) {
        DataFlow::moduleMember("child_process", "exec").getACall().getArgument(0) = sink
    }
  }

from ExecConfiguration cfg, PathNode source, PathNode sink
where cfg.hasFlowPath(source, sink)
select sink.getNode(), source, sink, "User input flows to exec function."

结果中还会带有source到sink的路径。

extra element

无论是alert query还是path query目前都只select出了一个element，虽然可以在string中附加内容，但是很多时候我们更希望能拿到element数据，这样可以直接获得location等信息。

CodeQl允许在string字段中使用$@作为占位符，在string中每加一个$@占位符，我们就可以在select语句后面依次添加一对element/string，其中string代表显示在展示message中替换占位符的内容，element为其对应得节点。

比如我们修改上面得alert query

/**
 * @name testForAlertQuery
 * @kind problem
 * @problem.severity warning
 * @tags correctness
 * @id js
 */

import javascript
import DataFlow

from ParameterNode pn, SourceNode exec
where exec = DataFlow::moduleMember("child_process", "exec") and 
pn.getASuccessor*() = exec.getACall().getArgument(0)
select pn, "Function $@ parameter $@" + " flows to exec function.", 
pn.getParameter().getEnclosingFunction(), pn.getParameter().getEnclosingFunction().getName(),
pn.getParameter(), pn.getName()

结果中就能找到额外的element信息。

CodeQL CLI

CodeQl CLI是不开源的用于解析数据库执行ql查询的命令，使用CLI可以创建query数据库和批量测试ql文件。安装CLI也很简单，下载对应版本并添加环境变量即可。然后我们需要下载Codeql库ql文件，将其放在CodeQl CLI同级目录就可以，执行CLI命令时默认会搜索同级目录（及子目录）下所有的QL packs。

创建数据库时，对于脚本语言JavaScript和Python等，比较简单

codeql database create --language=javascript --source-root <folder-to-extract> databaseName

对于编译型语言如java，需要指定--command参数来编译源代码。

codeql database analyze <database> --format=<format> --output=<output> <queries>

QL Packs

QL Packs 是很多ql文件以某种结构组成的一个库，CodeQl官方仓库提供了C/C++, C#, Java, JavaScript, Python等的库QL Packs，我们也可以打包有用的ql文件作为库以供他人调用。

QL Packs 需要在其根目录下存在一个qlpack.yml文件，这个文件描述了这个pack的语言类型、和其他pack的依赖关系等。所有 CodeQL CLI 加载了的ql packs都可以以qlpack.yml为根目录，按目录结构的形式import进来。比如下面官方的例子。

qlpack.yml内容参数可以看这里。

普通的query库，libraryPathDependencies用于指定依赖的packs，codeql/javascript-all是官方javascript库packs的名字。

name: my-queries
version: 0.0.0
libraryPathDependencies: codeql/javascript-all

QL test

在实现我们的库时，通常需要写一些test，CodeQl提供了相关的功能，Codeql会批量执行对应ql文件，对比expected文件结果。

首先我们要创建一个test QL pack，将我们的test文件放在里面，其中qlpack.yml需要包含以下内容

name: <name-of-test-pack>
version: 0.0.0
libraryPathDependencies: <codeql-libraries-and-queries-to-test>
extractor: <language-of-code-to-test>

其中libraryPathDependencies就是你这个test QL pack所依赖的pack，也可以说是需要测试pack。

然后在这个test pack中创建子文件夹，其中需要包含.qlref和.expected以及测试代码文件文件，.qlref文件内容为要执行的ql文件位置，.expected文件内容为执行ql文件后的预期返回内容。

执行测试命令时，codeql会先将该文件夹下的测试代码创建数据库，然后执行.qlref文件中指向的ql文件将结果与.expected结果进行比对判断test是否成功。

值得注意的是，.qlref和.expected文件名必须一样，如果ql文件就放在了同一目录下，那么就可以没有.qlref文件，但是ql文件的名字也需要一致。

一个test例子。

CodeQl For JavaScript

Basic library for javascript

在这里介绍一些常用的javascript官方库，主要是数据流分析相关库，也是我们做安全测试主要使用的部分。

data flow node

这里给出了一些DataFlow上的Node，DataFlow上的一个Node一般与AST上的一个Node对应。CodeQl官方库也给出了简便从DataFlow上获取全局变量或者import模块的方法。

DataFlow::globalVarRef("document")
DataFlow::moduleMember("fs", "readFile")

local data flow

CodeQl中使用过程内数据流分析很简单。DataFlow::Node提供了两个predicate getAPredecessor和getASuccessor，可以获取过程内数据流分析中流向这个Node或这个Node流向的Node。因为返回的也是一个DataFlow::Node所以可以利用nd.getASuccessor*()或nd.getASuccessor+()链式调用找到后续所有Node（*0-，+1-）。

global data flow

CodeQL for JavaScript 提供了Configuration类对全局数据流分析（过程间数据流分析）进行配置。在它的定义文件semmle\javascript\dataflow\Configuration.qll最上面的注释中我们可以大概理解它实现的原理。CodeQl实现了一个基于摘要的过程间数据流分析，跟踪过程间变量和部分对象属性的数据流，通过函数摘要的方式跟踪函数调用的数据流。

class MyDataFlowConfiguration extends DataFlow::Configuration {
  MyDataFlowConfiguration() { this = "MyDataFlowConfiguration" }

  override predicate isSource(DataFlow::Node source) { /* ... */ }

  override predicate isSink(DataFlow::Node sink) { /* ... */ }

  // optional overrides:
  override predicate isBarrier(DataFlow::Node nd) { /* ... */ }
  override predicate isBarrierEdge(DataFlow::Node pred, DataFlow::Node succ) { /* ... */ }
  override predicate isBarrierGuard(BarrierGuardNode guard) { /* ... */ }
  override predicate isAdditionalFlowStep(DataFlow::Node pred, DataFlow::Node succ) { /* ... */ }
}

isSource和isSinkpredicate用于约束数据流分析的起点和终点，上文已有例子。

通常输入安全措施有两种，sanitization、validation即过滤和验证。

// sanitization
var data = sanitize(input);


// validation
if(checkInput(input)){
    ...
}

if(input === "whoami"){
    ...
}

var arr = ["1", 2];
if(arr.includes(input)){
    ...
}

我们可以用isBarrier实现对过滤函数的处理，isBarrier函数参数为需要阻断数据流的Node，意味着满足下面条件的Node即使数据流传到了它也不会继续向下传递。

比如对于上面的sanitize函数的过滤，我们可以实现一个下面这种的barrier，表示所有以sanitize函数的调用节点都不会向后传递数据流。(注意，后面的isBarrierGuard是利用父isBarrier执行的，直接override不执行父 isBarrier predicate会导致配置isBarrierGuard失效)

  override predicate isBarrier(DataFlow::Node nd) {
    super.isBarrier(nd)
    or
    nd.(CallNode).getCalleeName() = "sanitized1"
  }

对于输入验证相关，可以使用isBarrierGuard来模拟有条件的阻断数据流。

isBarrierGuard的官方定义如下。参数BarrierGuardNode类实例需要定义一个限制方法，来告诉CodeQl在某个条件语句遇到这种情况在then或者else分支种限制某个Node的传递。

 /**
   * Holds if data flow node `guard` can act as a barrier when appearing
   * in a condition.
   *
   * For example, if `guard` is the comparison expression in
   * `if(x == 'some-constant'){ ... x ... }`, it could block flow of
   * `x` into the "then" branch.
   */

要实现一个BarrierGuardNode也很简单，定义一个继承自BarrierGuardNode的类，然后需要实现一个名为 blocks 的 predicate。它含有两个参数，意味着当使用这个 guard Node 作为条件语句时，条件返回为outcome的情况下，会在then分支过滤e对应的Node。

/**
 * if(checkInput(input)){
 *   ...
 * }
 * 检查当一个CallNode作为条件语句结果时，如果调用的函数名为CheckInput，
 * 那么函数返回true的话，就在then分支中过滤这个callNode的第一个参数。
 */
class CheckInputBarrierGuardNode extends BarrierGuardNode, CallNode {
  CheckInputBarrierGuardNode() { this.getCalleeName() = "CheckInput" }

  override predicate blocks(boolean outcome, Expr e) {
    outcome = true and
    e = getArgument(0).asExpr()
  }
}

/**
 * if(checkInput(input)){}  
 * if(input === "whoami"){}
 * var arr = ["1", 2];
 * if(arr.includes(input)){}
 * 检查当一个membership test作为条件语句条件时，membership test可以为静态值比较的EqualityTest
 * 也可以是 Array include等，具体可以看实现文件semmle\javascript\MembershipCandidates.qll
 */
class StaticValueBarrierGuardNode extends BarrierGuardNode {
  MembershipCandidate candidate;

  StaticValueBarrierGuardNode() { this = candidate.getTest() }

  override predicate blocks(boolean outcome, Expr e) {
    candidate = e.flow() and candidate.getTestPolarity() = outcome
  }
}

然后再配置Configuration

  override predicate isBarrierGuard(BarrierGuardNode guard) {
    guard instanceof CheckInputBarrierGuardNode
    or
    guard instanceof StaticValueBarrierGuardNode
  }

还有一种实现的方法，就是继承自AdditionalBarrierGuardNode，就不用配置isBarrierGuard了。原理也很简单看下源码就知道了。

class CheckInputBarrierGuardNode1 extends AdditionalBarrierGuardNode, CallNode {
  CheckInputBarrierGuardNode1() { this.getCalleeName() = "CheckInput" }

  override predicate blocks(boolean outcome, Expr e) {
    outcome = true and
    e = getArgument(0).asExpr()
  }

  override predicate appliesTo(Configuration cfg) { any() }
}

isBarrierEdge和isAdditionalFlowStep类似，前者是阻断对某类Node到某类Node的数据流，而后者是额外添加数据流的连接。Configuration.qll文件中flowStep predicate 记录了CodeQl过程间分析定义的step。

我们在分析一些复杂的项目的时候，可能会存在一些动态函数的数据流断掉了，那么可以让我们手动加上去。

let data = argFunc(input); 
exec(data);

下面代码的意思是将pred 和succ两个Node进行连接，数据流就连起来了。

  override predicate isAdditionalFlowStep(DataFlow::Node pred, DataFlow::Node succ) {
    exists(CallNode call |
      call.getCalleeName() = "argFunc" and
      pred = call.getArgument(0) and
      succ = call
    )
  }

下面是整体的代码。

/**
 * @name inter-procedual data flow analysis
 * @kind path-problem
 * @problem.severity warning
 * @tags security
 * @id js
 */

import javascript
import DataFlow
import PathGraph

class MyConfiguration extends Configuration {
  MyConfiguration() { this = "MyConfiguration" }

  override predicate isSource(DataFlow::Node source) {
    exists(CallExpr pn | pn.getCalleeName() = "easySource" | source.asExpr() = pn)
  }

  override predicate isBarrier(DataFlow::Node nd) {
    super.isBarrier(nd)
    or
    nd.(CallNode).getCalleeName() = "sanitized1"
  }

  override predicate isSink(DataFlow::Node sink) {
    DataFlow::moduleMember("child_process", "exec").getACall().getArgument(0) = sink
  }

  override predicate isBarrierGuard(BarrierGuardNode guard) {
    guard instanceof CheckInputBarrierGuardNode
    or
    guard instanceof StaticValueBarrierGuardNode
  }

  override predicate isAdditionalFlowStep(DataFlow::Node pred, DataFlow::Node succ) {
    exists(CallNode call |
      call.getCalleeName() = "argFunc" and
      pred = call.getArgument(0) and
      succ = call
    )
  }
}

class CheckInputBarrierGuardNode1 extends AdditionalBarrierGuardNode, CallNode {
  CheckInputBarrierGuardNode1() { this.getCalleeName() = "CheckInput" }

  override predicate blocks(boolean outcome, Expr e) {
    outcome = true and
    e = getArgument(0).asExpr()
  }

  override predicate appliesTo(Configuration cfg) { any() }
}

class CheckInputBarrierGuardNode extends BarrierGuardNode, CallNode {
  CheckInputBarrierGuardNode() { this.getCalleeName() = "CheckInput" }

  override predicate blocks(boolean outcome, Expr e) {
    outcome = true and
    e = getArgument(0).asExpr()
  }
}

class StaticValueBarrierGuardNode extends BarrierGuardNode {
  MembershipCandidate candidate;

  StaticValueBarrierGuardNode() { this = candidate.getTest() }

  override predicate blocks(boolean outcome, Expr e) {
    candidate = e.flow() and candidate.getTestPolarity() = outcome
  }
}

from MyConfiguration cfg, PathNode source, PathNode sink
where cfg.hasFlowPath(source, sink)
select sink.getNode(), source, sink, sink.toString()

global taint tracking

taint tracking 和 data flow 类似，可以看到每个predicate都有相互对应的，其使用方法也是类似的。

class MyTaintTrackingConfiguration extends TaintTracking::Configuration {
  MyTaintTrackingConfiguration() { this = "MyTaintTrackingConfiguration" }

  override predicate isSource(DataFlow::Node source) { /* ... */ }

  override predicate isSink(DataFlow::Node sink) { /* ... */ }

    // optional overrides:
  override predicate isSanitizer(DataFlow::Node nd) { /* ... */ }
  override predicate isSanitizerEdge(DataFlow::Node pred, DataFlow::Node succ) { /* ... */ }
  override predicate isSanitizerGuard(SanitizerGuardNode guard) { /* ... */ }
  override predicate isAdditionalTaintStep(DataFlow::Node pred, DataFlow::Node succ) { /* ... */ }
}

查看官方库的代码semmle\javascript\dataflow\TaintTracking.qll，taint tracking在data flow的基础上，添加了对于字符串数组等操作相关的additionalStep，它抽象出了一个SharedTaintStep类，只要继承自它并实现了uriStep/persistentStorageStep/heapStep/arrayStep/viewComponentStep/stringConcatenationStep/stringManipulationStep/serializeStep/deserializeStep/promiseStep等predicate，就可以方便的对taintTracking扩展额外的step。一个官方实现的例子在Arrays.qll中，可以看到对数组一些操作进行了model，实现了taint数据在数组操作之间的流动。

继承了AdditionalSanitizerGuardNode类也实现了很多guard过滤的方法，这里就包含了我们上面MembershipTestSanitizer还有empty限制x.length === "0" in限制if(x in o)正则限制等，我们可以通过看这里面的代码学习这么写guard。

flow labels

对于数据流中的数据，我们可以在数据流传播的过程中将数据打上标签，这样的话我们可以实现更加复杂的数据流分析。

官方实现了两种 flow label，dataflow 的 data，和taint tracking 的 taint。官方库中对于additionalStep的实现就是保持pred和succ的flow label，Barriers实现方法就是将node的data类型的label去掉，Sanitizers就是将taint类型的label去掉。

有时候我们也需要对数据流中的数据打上更多的标签，文档中举了一个很形象的例子，当我们要过滤某个path时，为了防止目录穿越，我们需要确保输入内容即不能是绝对路径，也不能含有..。这两个过滤应该分别作为一个guard或者sanitizer在数据流上限制数据，且需要两个同时存在时才能保证安全。那么我们就可以在每经过一种guard或sanitizer时修改标签，在isSink再检查它的标签就可以了。

下面我实现了一个对于下面代码的数据流分析，数据经过sanitized1函数应该被过滤，但是数据在后面进行了urldecode，此后的数据仍然可能是可控的。

const exec = require("child_process").exec

var input = easySource();
var data = sanitized1(input)
exec(data)
var data1 = decodeURI(data)
exec(data1)

所以我定义了两种label，经过sanitized1函数的时候对数据打上SantizedLabellabel，经过urldecode函数的时候将SantizedLabel转换成UrlDecodeLabellabel.

/**
 * @name inter-procedual data flow analysis with label
 * @kind path-problem
 * @problem.severity warning
 * @tags security
 * @id js
 */

import javascript
import DataFlow
import PathGraph

class UrlDecodeLabel extends DataFlow::FlowLabel {
  UrlDecodeLabel() { this = "UrlDecode" }
}

class SantizedLabel extends DataFlow::FlowLabel {
  SantizedLabel() { this = "SantizedLabel" }
}

class MyConfiguration extends TaintTracking::Configuration {
  MyConfiguration() { this = "MyConfiguration" }

  override predicate isSource(DataFlow::Node source) {
    exists(CallExpr pn | pn.getCalleeName() = "easySource" | source.asExpr() = pn)
  }

  override predicate isSanitizerEdge(DataFlow::Node pred, DataFlow::Node succ) {
    exists(CallNode call | call.getCalleeName() = "sanitized1" |
      pred = call.getArgument(0) and
      succ = call
    )
  }

  override predicate isSink(DataFlow::Node sink, FlowLabel lbl) {
    DataFlow::moduleMember("child_process", "exec").getACall().getArgument(0) = sink and
    not lbl instanceof SantizedLabel
  }

  override predicate isAdditionalFlowStep(
    DataFlow::Node src, DataFlow::Node trg, FlowLabel inlbl, FlowLabel outlbl
  ) {
    exists(CallNode call |
      call.getCalleeName() = "decodeURI" and
      src = call.getArgument(0) and
      trg = call and
      inlbl instanceof SantizedLabel and
      outlbl instanceof UrlDecodeLabel
    )
    or
    exists(CallNode call |
      call.getCalleeName() = "sanitized1" and
      src = call.getArgument(0) and
      trg = call and
      outlbl instanceof SantizedLabel
    )
  }
}

from MyConfiguration cfg, PathNode source, PathNode sink
where cfg.hasFlowPath(source, sink)
select sink.getNode(), source, sink, sink.toString()

从我们的上面的代码中可以看出，像isSource/isSink/isAdditionalFlowStep等几乎所有的Configuration的predicate 都是存在含有Flow labels 参数的版本的，看文档或者看源代码就可以得到详细用法。

Debug

write some predicate for quick evaluation
useful debug tools in semmle.javascript.explore In semmle.javascript.explore.CallGraph there are 3 predicates callEdge, isStartOfCallPath and isEndOfCallPath to help us explore(without global taint tacking) in call graph.

semmle.javascript.explore.ForwardDataFlow and semmle.javascript.explore.BackwardDataFlow is private class.

For java, We can use Partial flow

Others

CodeQl for JavaScript 官方库中提供了大量的有用的ql库。比如对于http服务器应用，无论是net库或者是Express都有了一定程度的支持。

比如我们可以直接拿到Express的http输入NodeExpress::RequestInputAccess作为数据流分析的source，以及一些很方便的router信息。

对于npm包也有响应的解析，我们可以直接拿到Exports出来的函数和参数之类的东西。很多东西翻翻那官方库都能找到。

CodeQl 从0到0.1

CodeQl 从0到0.1

CodeQl General

Metadata for CodeQl queries

query kind

Write a query

alert query

path query

extra element

CodeQL CLI

QL Packs

QL test

CodeQl For JavaScript

Basic library for javascript

data flow node

local data flow

global data flow

global taint tracking

flow labels

Debug

Others

Recommend

CodeQL-like analyzer for Go

CodeQL + XNU From 0 to 1

Using GitHub code scanning and CodeQL to detect traces of Solorigate and other b...

CodeQL JS/TS Journey

GitHub - github/codeql: CodeQL: the libraries and queries that power security re...

How the community powers GitHub Advanced Security with CodeQL queries

CodeQL U-Boot Challenge（C/C++）

CodeQL学习笔记

CodeQL 与 Shiro550 碰撞

Sharing security expertise through CodeQL packs (Part I)

About Joyk