README.md

Description

cpu_rec is a tool that recognizes cpu instructions in an arbitrary binary file. It can be used as a standalone tool, or as a plugin for binwalk (https://github.com/devttys0/binwalk).

Installation instructions

Standalone tool

Copy cpu_rec.py and cpu_rec_corpus in the same directory.
If you don't have the lzma module installed for your python (this tool works either with python3 or with python2 >= 2.4) then you should unxz the corpus files in cpu_rec_corpus.
If you want to enhance the corpus, you can add new data in the corpus directory. If you want to create your own corpus, please look at the method build_default_corpus in the source code.

For use as a binwalk module

Same as above, but the installation directory must be the binwalk module directory: $HOME/.config/binwalk/modules.

You'll need a recent version of binwalk, that includes the patch provided by https://github.com/devttys0/binwalk/pull/241 .

How to use the tool

As a binwalk module

Add the flag -% when using binwalk.

Be patient. Waiting a few minutes for the result is to be expected. On my laptop the tool takes 25 seconds and 1 Gb of RAM to create the signatures for 70 architectures, and then the analysis of a binary takes one minute per Mb. If you want the tool to be faster, you can remove some architectures, if you know that your binary is not one of them (typically Cray or MMIX are not found in a firmware).

As a standalone tool

Just run the tool, with the binary file(s) to analyze as argument(s) The tool will try to match an architecture for the whole file, and then to detect the largest binary chunk that corresponds to a CPU architecture; usually it is the right answer.

If the result is not satisfying, prepending twice -v to the arguments makes the tool very verbose; this is helpful when adding a new architecture to the corpus.

If https://github.com/LRGH/elfesteem is installed, then the tool also extract the text section from ELF, PE, Mach-O or COFF files, and outputs the architecture corresponding to this section; the possibility of extracting the text section is also used when building a corpus from full binary files.

Option -d followed by a directory dumps the corpus in that directory; using this option one can reconstruct the default corpus.

Examples

Running the tool as a binwalk module typically results in:

shell_prompt> binwalk -% corpus/PE/PPC/NTDLL.DLL corpus/MSP430/goodfet32.hex

Target File:   .../corpus/PE/PPC/NTDLL.DLL
MD5 Checksum:  d006a2a87a3596c744c5573aece81d77

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0             0x0             None (size=0x5800, entropy=0.620536)
22528         0x5800          PPCel (size=0x4c800, entropy=0.737337)
335872        0x52000         None (size=0x1000, entropy=0.720493)
339968        0x53000         IA-64 (size=0x800, entropy=0.491011)
342016        0x53800         None (size=0x22000, entropy=0.727501)

Target File:   .../corpus/MSP430/goodfet32.hex
MD5 Checksum:  4b295284024e2b6a6257b720a7168b92

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0             0x0             MSP430 (size=0x5200, entropy=0.472185)
20992         0x5200          None (size=0xe00, entropy=0.467086)

We can notice that during the analysis of PPC/NTDLL.DLL a small chunk has been identified as IA-64. This is an erroneous detection, due to the fact that the IA-64 architecture has statistical properties similar to data sections.

If the entropy value is above 0.9, it is probably encrypted or compressed data, and therefore the result of cpu_rec should be meaningless.

Known architectures in the default corpus

68HC08 68HC11 8051 Alpha ARcompact ARM64 ARMeb ARMel ARMhf AVR AxisCris Blackfin Cell-SPU CLIPPER CompactRISC Cray Epiphany FR-V FR30 FT32 H8-300 HP-Focus HP-PA i860 IA-64 IQ2000 M32C M32R M68k M88k MCore Mico32 MicroBlaze MIPS16 MIPSeb MIPSel MMIX MN10300 Moxie MSP430 NDS32 NIOS-II OCaml PDP-11 PIC10 PIC16 PIC18 PIC24 PPCeb PPCel RISC-V RL78 ROMP RX S-390 SPARC STM8 Stormy16 SuperH TILEPro TLCS-90 TMS320C2x TMS320C6x V850 VAX Visium WE32000 X86-64 X86 Xtensa Z80 #6502#cc65

Licence

The tool

The cpu_rec.py file is licenced under a Apache Licence, Version 2.0.

The default corpus

The files in the default corpus have been built from various sources. The corpus is a collection of various compressed files, each compressed file is dedicated to the recognition of one architecture and is made by the compression of the concatenation of one or many binary chunks, which come from various origins and have various licences. Therefore, the default corpus is a composite document, each sub-document (the chunk) being redistributed under the appropriate licence.

The origin of each chunk is described in cpu_rec.py, in the function build_default_corpus. The licences are:

files libgmp.so, libc.so, libm.so come from Debian binary distributions and are distributed under GPLv2 (and LGPLv3 for recent versions of libgmp) and the source code is available from http://archive.debian.org/.
busybox binaries come from https://busybox.net/downloads/binaries/ and are distributed under GPLv2.
C-Kermit binaries come from ftp://kermit.columbia.edu/kermit/bin/ and are distributed under GPLv2 (according to ftp://kermit.columbia.edu/kermit/archives/COPYING but the status of each binary is not always clear).
all files identified in build_default_corpus as part of the CROSS_COMPILED subdirectory have been built by myself. The corresponding source code are zlib (from http://zlib.net/, distributed under the zlib licence) or libjpeg (from http://www.ijg.org/, distributed under an unknown licence) or some other code based on public sources (e.g. https://anonscm.debian.org/cgit/pkg-games/bsdgames.git/tree/arithmetic/arithmetic.c modified to work with SDCC compilers).
The camlp4 binary is built from https://github.com/ocaml/camlp4 and distributed under LGPLv2.
The binary for TMS320C2x comes from https://github.com/slavaprokopiy/Mini-TMS320C28346/blob/master/For_user/C28346_Load_Program_to_Flash/Debug/C28346_Load_Program_to_Flash.out where it is distributed under an unknown licence.
The binary for RISC-V comes from https://riscv.org/software-tools/ distributed under GPLv2 and can downloaded at https://github.com/radare/radare2-regressions/blob/master/bins/elf/analysis/guess-number-riscv64
The binaries for PIC10 and PIC16 come from http://www.pic24.ru/doku.php/en/osa/ref/examples/intro where they are distributed under an unknown licence.
The binary for PIC18 comes from https://github.com/radare/radare2-regressions/blob/master/bins/pic18c/FreeRTOS-pic18c.hex where it seems to be distributed under GPLv3 (or later).
The binary for PIC24 comes from https://raw.githubusercontent.com/mikebdp2/Bus_Pirate/master/package_latest/BPv4/firmware/bpv4_fw7.0_opt0_18092016.hex distributed under Creative Commons Zero.

GitHub - airbus-seclab/cpu_rec: Recognize cpu instructions in an arbitrary binar...

README.md

Description

Installation instructions

Standalone tool

For use as a binwalk module

How to use the tool

As a binwalk module

As a standalone tool

Examples

More documentation

Known architectures in the default corpus

Licence

The tool

The default corpus

Recommend

气势要到位

自动驾驶受追捧，软件服务商的服务商在做什么？

全国交通一卡通发布：畅游全国210个城市 - IT与交通 - cnBeta.COM

网友发现谷歌地图干涸的湖床上刻有巨大不雅图案 - Google Maps 地图 / Earth - cnBeta...

国家下了铁令微信、支付宝付款流程今起将发生巨变 - 电子商务 - 支付平台/互联网金融...

GitHub - langyanduan/Reborn: The missing proxy for macOS

GitHub - google/sg2im: Code for "Image Generation from Scene Graphs",...

Fractal — Nodejs app structure

How Immutable Data Structures Are Optimized

陆奇重出江湖:或成腾讯海外高层受邀拼多多只是小菜

About Joyk