4

DGTEFF - XentaxWiki

 3 years ago
source link: http://wiki.xentax.com/index.php/DGTEFF
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

DGTEFF

This document explains in detail how to start exploring and examining file formats, with a focus on Game Resource Archives. For beginners and advanced users alike.
The definitive word in archive exploration.

Download below, or scroll on down and read it here:

DGTEFF as PDF

DGTEFF as ZIPPED PDF

Authors: Mr.Mouse and Watto

Version: 1.0 as of November 2004

Rewritten for the WIKI by Dinoguy1000 as of August 2006


Title page


THE DEFINITIVE GUIDE TO EXPLORING FILE FORMATS  

= Revision 2 =

WATTO

(www.watto.org)

Mike Zuurman

(www.xentax.com)

Table of Contents

Introduction

General Introduction

Computer games are vast and many, covering a wide range of genres and game styles, but there is one fundamental feature that all games require - resources. Every game has a range of resources that help make it unique - from texture images to audio soundtracks. With all these resources, there needs to be a way they can be stored so that games can use them, and the way this is typically done is to store them in a big archive file.

An archive is a single computer file that contains the data for several smaller files. A common analogy would be a cardboard box - it can be used to store a lot of different items (paper, food, objects), and each item can have different properties (size, color, shape)

The question that may arise is "why do game developers use archives to store their game resources? Wouldn’t it be easier to just store all the files normally?" The answer is yes, storing the files normally would be much easier, and certainly much better during the game development, but before the final production they are packaged into archives for several reasons…

  • An archive can store a lot of files in a single location, so it is quicker to access the files from a hard disk or CD
  • A large archive, due to it being in 1 block on the disk, can utilise features such as file buffers, further increasing read performance
  • It reduces the number of files on the disk, making the reading of the file index quicker
  • The files can be hidden away, making it harder to hack or modify the game
  • All files can be accessed using a single file stream, reducing the time required to generate file stream objects, and making the file access programming simpler
  • Files can be compressed easily, and other information such as file descriptions and ID numbers can be stored

Purpose Of This Book

Unfortunately, there is a downside to using archives - there are no real standards defined for the creation and use of archives. In order to read or write archives for a particular game, someone usually needs to analyse the file themselves, or perform other complicated and time-consuming tasks such as reverse engineering or hex editing.

Some of the more modern games produced these days recognise that they can gain extra advertising by allowing the internet community to modtheir games. Due to this, some game developers have changed to supporting standard archive types, such as Zip archives, however there is still an overwhelming number of games with their own proprietary archive formats.

Mod, short for modification, refers to the alteration of a computer game by a member of the internet community, usually to support extra functionality or to generate a different game built on top of the original. Some examples include changing the sounds and textures used by a game, or creating new game maps.

This book aims to provide an insight into the way game archives are created, and how to analyse an archive to locate the files contained within. In the following pages, we will discuss some of the basic fundamentals of computer-stored numbering, common structures used by most archives, compression, encryption, and the tools that you can use to help get the job done. Hopefully, by the time you have finished reading this book, you will be able to analyse your own archives, and take the first step towards your own development and game modding.

Thanks for reading our book, we wish you the best of luck in your exploration ïÂÅ .

Formatting Used In This Book

Link

A link to a website of interest or for further information. Link

A link to a different section of the document. Term

An important term, or a term that is being defined.
A general comment, or clarification of a point. Value

A value, usually in an example Caption

Caption for an image, or a reference to some information in the image Reference

A tool reference, such as a menu, button, or action in a specific program. Brief descriptions of a term, related notes, or other supplementary material will be presented in a box like this. This will often accompany a term.  

 

 

What is a GRAF?

The term GRAF describes the way a game archive is constructed, and in particular, the storage of the files within the archive. The format of an archive usually differs between each individual game, however occasionally a game developer will stick with a particular format for a few games of the same vintage, particularly if the games are built using the same underlying game engine.

GRAF stands for Game Resource Archive Format, which is most simply the specifications describing the format of a particular archive.

Programmers usually define their GRAFs according to the needs and structure of the game itself. For example, the memory in an XBOX game console is based around blocks of 2048 bytes - the GRAFs for most XBOX games utilise this so the game data can be opened efficiently.

The development of a GRAF is particularly troublesome - there is a constant weigh-up between factors such as efficient storage, quick loading, and fast targeting. One of the things that has great influence is human readability - the things that make archives easy for humans to use, often make it less efficient. For example, the storing of filenames in an archive tells humans the purpose and type of data, however it is very inefficient and slow to read filenames from an archive - thus the weigh-up.

Efficient storage: Files need to be stored in a way that conserves space on the disk and/or in memory.   Quick Loading: When the game is loading, the required resources are loaded into memory - this needs to be done quickly, while still gathering all the required information.   Fast Targeting: When a resource is loaded into memory, it needs to be quick and easy for the game to find the file. This is usually a big weigh-up between human readability (filenames) vs. computer efficiency (hash fields and trees).

During the game development, the actual resources used by the game change frequently. To make it quick and easy to adapt the changes, the GRAF is usually structured following a common and recognisable pattern, some of which will be described in later chapters.

Tools of the Trade

Hex Editors

The generic hex editor is the main type of program used to view data in non-text files, such as archives. Similarly to the way word processors display text data, a hex editor displays the contents of a file using hex characters.

Hex characters are an alternate way to represent the byte data in a file. Whereas word processors display byte values as letters, hex displays each byte as a 2-character code that represents all possible values 0-256 (00-FF). The way to read and construct hex values is discussed in a later chapter

There are literally hundreds of hex editors available for use - the one that you choose is your own personal taste. All hex editors have the same basic functionality, but some provide other tools and features that make it quicker and easier to work with files. Most hex editors are freely available over the Internet.

Following, we will provide a brief introduction into our own preferred programs, so you can see the general style and features available to you. This list is personal preference only - we encourage you to actively seek out your own preferred programs.

Hex Workshop from Breakpoint Software will be used for the examples and screenshots in this book, however the processes and screens should be similar across all hex editors. Hex Workshop includes several handy functions for analysis work, such as:

  • A hexadecimal calculator
  • Lists of the data types at the current location in the file
  • Bookmarking
  • Colour mapping.
Hex Workshop is available from http://www.bpsoft.com

While we encourage you to try many different programs, the one you ultimately choose should be based on your needs.

Hex Workshop

Here we present a brief introduction into the use of Hex Workshop. Although this is the main program that will be used for the screenshots in this book, take note that almost everything in this program can be applied to other hex editors, including the interface structure and layout.

n[[File:## Error Converting ##]] File:Guide To Exploring File Formats - 011 - 01.png Figure 3.1.1a: General layout of Hex Workshop

A. Hexadecimal representation of the file content

B. ASCII interpretation of the file content

C. Different representations of the data at current cursor position

D. User-assigned bookmarks and their descriptions

When you have installed Hex Workshop, a convenience link is added to the context menu of Windows Explorer. Just right-click on a file and select "Edit with Hex Workshop" to open the file in the program

The context menu is the menu that appears when you right-click in a Windows program. Named due to the fact that the links in the menu depend on the context of the right-click. For example, right-clicking on a file will give different choices to right-clicking on a selected piece of text.

Once you have opened a file, you will be presented with a view similar to that depicted in Figure 3.1.1a. You can examine the files hexadecimal interpretation in section A, or the ASCII interpretation of the same bytes in section B. The table to the far left shows the offset of the lines shown.

An offset is the location of the file data in relation to the start of the file. For example, an offset of value 560 means there are 560 bytes of data before you reach the current location.

In this example, we have opened one of the *.pk4 files from the game Doom 3. We will later see that these are actually generic *.zip files. For now, you can see the file starts with the characters PK. The characters at the beginning of a file are often referred to as a header, ID tag, or magic number - and are usually a reliable way to identify whether the file is a common type. For example, all *.zip archives have the characters PK at the beginning, therefore there is a strong probability that the archive in our example is a *.zip archive. A brief list of some common header tags can be found later in the book.

A header tag is simply a small group of bytes at the start of a file that help to identify the format of the remaining data. The header tag is usually a 4-byte string, however it can also be a preset set of byte values. While it is true that a file’s extension can help determine a file format, it is often unreliable and can be easily changed, whereas a header tag is hard to alter and is usually unique. In reality, the best way to determine a files format is to use a combination of the file extension and the header tag.

The current position of the cursor in our example is at offset 18. The Data Interpreter in section C shows the different interpretations of the data at this file position, ranging from numbers to strings. The different data interpretations are covered more completely in a later chapter.

In our example image, we have color mapped and bookmarked (as in section D) some areas of our interest. Any range of bytes can be bookmarked or color mapped - simply click and drag the cursor along your area of interest and select the appropriate option from the context menu. When you make a bookmark, you can choose the data interpretation of the selection (its value), and give a description. The bookmarks will be shown with their offset in the file and the length in bytes. This is a very useful feature, as it allows you to click on a bookmark to jump to that offset.

Color mapping: assigns a color to the selected area, to make it stand out.   Bookmarking: records the current cursor location in section D, with a user-defined description.

Hex Workshop has the ability to save the bookmarks and color maps, so that you can load them on another file and see if the pattern matches. ie, if you have solved the pattern of a GRAF, you can apply the bookmarks and colour mapping to other files that you expect to have the same format.

Hex Workshop has another handy function - GoTo. If you select a range of bytes in a file and choose GoTo from the context menu, you can jump to the location identified by the selected value.

Terms, Definitions, and Data Structures

To understand the patterns and construction of archives, we must first introduce the concept of data structures, and some of the fundamentals of computerized data.

Files

A computer file is a series of bytes stored one after the other which, when combined together, form a representation of a piece of data. If you have a file that is 12 bytes in size, it indicates that there are 12 single bytes of data that are used to represent the entire document.

The term File stems from the original computer metaphor as an office replacement. As in a work office, files were organised into folders, where each folder contained a group of related files.

File sizes start off at the preliminary byte, and change terms at every increment of 1024 bytes (although, for each of human use, most people refer to increments of 1000 bytes). The following table shows the increments of file size terms:

Byte (B)

 

 

  Kilobyte (KB)

1,024
(1 thousand bytes)

 

 

  Megabyte (MB)

1,048,576
(1 million bytes)

1,024
(1 thousand KB)

 

  Gigabyte (GB)

1,073,741,824
(1 billion bytes)

1,048,576
(1 million KB)

1,024
(1 thousand MB)

  Terabyte (TB)

1,099,511,627,776
(1 trillion bytes)

1,073,741,824
(1 billion KB)

1,048,576
(1 million MB)

1,024
(1 thousand GB) In actual fact, computer data is stored using bits, not bytes. A bit is the smallest unit that a computer can deal with, however all modern file systems treat a byte as being the smallest unit as a byte is capable of storing relatively useful information. It is impossible to store a single bit in a modern file system - the best that can be done is to store a single byte that has the same value as the bit.

When we talk about the basic structure of a file, we typically think in terms of bytes. However, at its absolute simplest, the actual underlying file structure is a sequence of bits or binary values. We don’t usually deal with this level of representation because binary values don’t have the ability to represent anything meaningful. However, when grouped into sets of 8 bits, the range of information that can be stored becomes satisfactory.

A bit, or binary value, is the language of a computer, and thus the underlying structure of everything readable by a computer. A bit only has 2 possible values – 1 or 0 – thus it is obvious why they are limited in what they represent.

The 2 possible values of a bit, 0 and 1, are also commonly referred to as being either false or true (respectively). It can therefore be said that a bit is either a true-bit or a false-bit.   Sometimes, although less common, a bit with value 0 is referred to as being disabled, and value 1 is enabled. This can sometimes help in user understanding, depending on the context of the discussion.

A byte, the fundamental building block of files, is constructed using a group of 8 bits. The combination of 8 bits allow a byte to hold any value between 0 and 255, much more than the 2 possible combinations available to a single bit.

So how do the grouped bits represent a larger numerical value such as that of a byte? This is achieved quite easily by referring to each of the 8 bits as an increasing power of 2.

If we take a look at a single bit, we can think of it as having either the value 1x20 or 0x20 – thus giving us the values 1 or 0 respectively. If we add a bit to the left, the power of the new bit is either 1x21 or 0x21 – either 2 or 0. By adding the values of these 2 bits together, you should be able to see that all possible combinations will give us the values 0, 1, 2, and 3, as shown in the table below:

Bit 1 (21)

Bit 0 (20)

Value 0

0

0 (0x21 + 0x20) 0

1

1 (0x21 + 1x20) 1

0

2 (1x21 + 0x20) 1

1

3 (1x21 + 1x20)

If we continue this pattern for the remaining 6 bits, our highest bit will provide the power 27. If all our 8 bits are enabled, we would end up with the number 255 (1x27 + 1x26 + … + 1x21 + 1x20). Appendix 1 provides a list of all possible byte values, and their bit value.

Bytes

As described above, a byte is comprised of 8 bits, and can thus contain a value between 0 and 255. Bytes are the smallest unit that a modern file system can deal with, so for the majority of file format analysis you will only need to look at the byte level.

All files are stored and accessed using bytes. When you open a file in a program or game, the bytes of the file are interpreted according to the logic of the application. For example, a word processor treats all bytes as being letters or numbers, whereas a hex editor displays bytes as hex codes. Hex codes will be discussed in a later chapter.

-bit (2-byte) numbers

From this point forward, we need to careful when referring to a particular data types. Why? Because as computers and programming languages have evolved, the terminology has changed and confusion can arise. Therefore we will primarily refer to each data type as the number of bits or bytes that comprise it.

We will also briefly introduce the terms for each group of programming languages, so you will be able to program with them.

A 16-bit value is commonly known in older programming languages as a word or an Integer. Newer programming languages call it a Short.

The term older programming languages refers to the language C++, and any language that was derived before this time, such as C, Visual Basic (1.0 - 6.0), ASP, Perl, Pascal, etc.   The term newer programming languages refers to languages derived after C++, such as Java, Python, Delphi, and the .Net languages (C , VB.net, ASP.net, J )

A 16-bit number is just as the name suggests, a number created by 16 bits in a row. To determine the value of the 16-bit number, we follow the same process as when we wanted to get the value of a byte.

Each of the 16 bits that make up the 16-bit number represents a power of 2 – the leftmost bit represents 215 and the rightmost bit 20. Just as with bytes, we just go through each bit and calculate the bitvalue x power.

An example – lets say we have the following 16 bits…

101111000001100

Working from left to right, we get the value…

1x215 + 0x214 + 1x213 + 1x212 + 1x211 + … + 1x22 + 0x21 + 0x20

If you work this out, you should end up with the number 24076.

If all 16 bits had the value 1, you would end up with the number 65535 – therefore the value of a 16-bit number ranges between 0 and 65535.

-bit (4-byte) numbers

A 32-bit number follows the same principles as a 16-bit number, with the exception that there are now 32 bits that represent the value. Therefore, the highest bit has a value 231 and the lowest bit has value 20.

The bit with the highest power value is known as the high-order bit, and in the same vein, the bit with the lowest power value is the low-order bit. When a file or computer system uses little-endian formatting, which is most often the case, the high-order bit is to the left and the low-order bit to the right. In big-endian formatting, this is not the case - more information about endian ordering will be presented in a later chapter.

If all the bits for a 32-bit number were enabled, we would have the value 4,294,967,295, thus the range of values for a 32-bit number is 0 to 4,294,967,295.

A 32-bit number in older programming languages is known as a dword or a Long. In newer languages, this is known as an Integer.

Dword is an abbreviation for Double Word, meaning that a dword has double the number of bits that a word has (ie 32 = 2x16).

-bit (8-byte) numbers

As with 32-bit numbers, 64-bit numbers can be calculated with the highest bit value 263 and lowest 20. Thus, the range 0 to a massive 18,446,744,073,709,551,615. Due to the extreme size of this number, it is not expected that we will ever need to define a larger term.

64-bit numbers are not supported by some of the older programming languages - those that do call it a qword. All newer programming languages refer to this data type as a Long.

64-bit numbers are relatively new concepts in the computer world, brought on by the ever-increasing size of hard drives, and technologies such as DVD. Old file systems such as FAT-32 (used by Windows 95 and Windows 98) were, as the name suggests, built around 32-bit numbers, but this inherently caused a problem with large files. Because a 32-bit number has a maximum value of 4,294,967,295, it meant that files that were larger than 4.3GB are not possible. Furthermore, a hard drive could not contain more than 4.3GB of file data. Due to this problem, 64-bit numbers were introduced, which allows for practically infinite amounts of storage space. 64-bit numbers are used in more modern file systems (NTFS for Windows XP), and for technologies like DVD that have large storage space.   A similar situation occurred during the transition from Windows 3.1 to Windows 95, where computer systems that were originally built on the FAT-16 16-bit file system were upgraded to FAT-32.

Strings

One of the most common tasks performed on a computer is word processing, so naturally we need some way of representing text in a document. A piece of text in a document is called a String, which more formally means a sequence of characters.

You need to be careful when using the term character, as it can be different depending on the programming language, and indeed depending on the language of your country. A character in older programming languages is usually the same as a byte, whereas in newer languages it is often the same as a 16-bit short. If the game or file was developed in a primarily English-speaking country (as most are), characters will usually be bytes regardless of the programming language used to write the game. Games from non-English speaking countries will usually be 16-bit shorts.

Although there are many languages in the world, the first Latin language used in the Western world is English. The English script consists of 52 letters (upper and lower case), 10 numbers, and about 30 symbols. Seeing as though this adds up to about 92, it seems quite logical that we can represent each character as a different byte value (remembering that a byte supports up to 255 different numbers). This is exactly what happens when you open a text document in a word processor – the word processor reads the bytes of the file and represents each byte value as a character.

For example, when the word processor reads a byte with value 65, it displays the letter "A". The byte value 100 represents the letter "d". Therefore, you can open any file in a word processor and it will be displayed as characters, regardless of whether it is a text document or not - the word processor simply doesn’t know that it isn’t a text file. The representation of a byte as a character is defined as ASCII, for which the character associations are listed in Appendix 2.

ASCII stands for American Standard Code for Information Interchange, which was originally defined as a 7-bit character system (as all letters and numbers account for less than 128 values). As computer systems evolved, bytes became the standard unit of the computer, and as such the ASCII standard was adjusted into a full 8-bit character system. The original letters remained the same, with the 8th bit having the value 0. The newly-created 128 characters (the ones with the 8th bit equal to 1) were assigned to additional common characters such as letters with accents, foreign currency symbols, and other miscellaneous symbols like fractions and degrees.

To expand the computing world into other languages, it became apparent that there are hundreds more letters and symbols - much more than the originally-defined 256. Therefore, an alternate character scheme called Unicode was created, which uses 2 bytes to represent each character rather than the usual 1 byte. To accommodate the original ASCII coding scheme, the value for each ASCII character is the same as the value for the first byte in each Unicode character, with the second bytes having the value 0.

It is usually easy to determine whether a string is ASCII or Unicode. ASCII strings are easy to read in a hex editor, whereas the same English string represented as Unicode has a null byte between each letter (the second byte of each Unicode character.)

Here is an example string represented as ASCII and Unicode. Note that the null bytes are represented with a . symbol, as is common in many hex editors.

Original:

When I run fast, my legs get tired. ASCII:

When I run fast, my legs get tired. Unicode:

W.h.e.n. .I. .r.u.n. .f.a.s.t., . .m.y. .l.e.g.s. .g.e.t. .t.i.r.e.d.. .

As you can see, the ASCII string appears the same as it would in a word processor, whereas the Unicode string consumes 2 bytes and thus seems padded out. Note that every character in the Unicode string has 2 bytes, including the spaces and commas.

Hexadecimal Numbering

Hexadecimal numbering is an alternate way to represent the byte values 0-255. In traditional numbering, you need 1-3 characters to display the possible values for a byte. For example, the number 5 requires 1 character, whereas the number 113 requires 3 characters. Hexadecimal numbering was introduced to represent all byte values using exactly 2 characters, which means that bytes can easily be arranged into neat rows and columns.

Recall that a byte contains a value between 0 and 255. To write any of these values in hexadecimal, we split it into 2 characters, each representing a power of 16. This is done in a similar way to binary numbers, where each character represents a power of 2.

A problem arises: how do we represent 16 possible values in a single character. It is obvious that the values 0-9 can be represented as normal numbers, and for the values 10-15 we assign the letters A through F respectively. For example, the letter C in hexadecimal represents the value 12.

So how do we write a number in power 16? As mentioned earlier, the byte value is split up into 2 characters, with the first number representing 161 and the second number representing 160. You should notice that this is the same way bits are joined together to form a byte.

The second number of the pair can take any value between 0 and 16 (labelled 0 through F), where the value represents number x 160. So, if the second number was 6, it would represent the number 6x160 –the value 6. if the second number was B, it would similarly represent the value 11x160 – the value 11.

The first number of the pair represents the value number x 161. So, if the first number was 2, it would represent the value 2x161– the value 32.

Lets look at a full example now. If we are given the hexadecimal value 1F, what does it represent? The number 1 means 1x161, and the F means 15x160. Added together, we get 16 + 15, the value 31. Similarly, the hexadecimal number E3 represents 14x161 + 3x160, the number 227.

It should be clear to you now that we can represent any byte (values 0 through 255) in the hexadecimal number system using the values 00 through FF. If we are writing a hexadecimal number in a document, we use the format &h# . For example, &hE3 means E3 in the hexadecimal coding scheme.

Signed and Unsigned Numbers

Hopefully by now you can clearly see how numbers are stored in files, and even how strings are stored, but what about negative numbers? Luckily, negative numbers are really easy.

There are only 2 possible types of numbers – either positive or negative. This maps perfectly with a single bit of value 0 or 1 respectively.

Rather than add an extra bit to a number, we take the bit with the highest value and interpret it as a positive or negative sign. In an 8-bit number, for example, you would count all the bits from 20 to 26, and the value of the 27 bit will determine whether the value is positive or negative.

You should note that because the highest bit is being used for another purpose (identifying positive/negative), it cannot be used as part of the number itself. This effectively cuts the possible values of the number in half. In our example, you would normally be able to have any value between 0 and 255, however with the negative bit we now have numbers between -127 and 128. As there is no such thing as - 0, the bit code 10000000 is given the value 128

Here we need to introduce a way of knowing whether a number will be positive-only, or a positive/negative number. We therefore use the term signed to indicate that the highest bit is used as a sign, or the term unsigned indicating the number is always positive. Therefore, if you are told a 16-bit number is unsigned, you will know the number ranges between 0 and 65535. However, if it was a signed 16-bit number, it would range between –32767 and 32768.

The type of file usually determines whether the numbers are signed or unsigned. For example, archives and images are almost solely unsigned values. 3D-related files are often signed, as it is possible to have points in the negative as well as the positive plane.

You should note that signed numbers are extremely rare for archives, and as such, you should assume all numbers used in archives and in this document are unsigned.

Big-Endian and Little-Endian

If you paid close attention, you would have noticed that whenever we calculate a number, the bit with the highest value was always on the left, and the lowest value on the right. This is regarded as almost a standard today amongst PC users, however some files, programs, and computer systems decided it was better to read it the other way around (right-to-left instead of left-to-right). So once again, we need to define some terms so that people know what order we are talking about.

Little-Endian order is the one we will be using in this document, and unless stated specifically you should assume that Little-Endian order is used in any file. The alternate is Big-Endian ordering.

So lets see an example. Take the following stream of 8 bits

10001110

If you have been following the document so far, you would quickly calculate the value of this 8-bit number as being

1x27 + 0x26 + … + 1x21 + 0x20 = 142

This is an example of Little-Endian ordering. However, in Big-Endian ordering we need to read the number in the opposite direction

1x20 + 0x21 + … + 1x26 + 0x27 = 113

It is always important to read the numbers in the correct order, otherwise you will end up with numbers that are meaningless and incorrect. As mentioned, if you don’t know which order to use, assume Little-Endian ordering - we will be using Little-Endian order for all examples in this document.

File Offsets

One of the most fundamentals of format exploring is the concept of file offsets. A file offset is the position of a certain piece of data in a file, measured from the first byte of the file. However, as with most computer programming, we start our number counts at 0, not at 1. Therefore, if we are at the very beginning of the file, before we read anything, we are at offset 0. After we read 1 byte, we are at offset 1. Read another 6 bytes and we are at offset 7.

If the concept is a little hard to grasp, think of an offset as being a bar that divides a file up byte-by-byte. If we are at the beginning of a file, offset 0, we have a bar right at the beginning before the first byte

I0110001011011000001011110

If we are at offset 3, we place the bar after the 3rd byte of the file, and before the 4th byte (ie. we have read 3 bytes)

011I0001011011000001011110

Similarly, offset 16 places the bar after byte 16, and before byte 17

0110001011011000I001011110

Archive Patterns

There are literally thousands of different archives out there, however most archives will conform to one of several basic patterns. Here we present the basic archive patterns so you can understand how the files are built, and how they can be read. Once you know the patterns, you can identify archive formats much faster.

Note that the samples presented here list some fields like the number of files and the header. These fields are not fixed, and indeed may be totally different to the structure of your archive. You should use the samples presented here as a guide to the overall structure, not as a exact guide to a specific format.

In these examples, the numerical value at the start of each field indicates the number of bytes used to contain the field value. For example, the line 4 - Directory Offset shows that there are 4 bytes used to store the directory offset. This field would thus be read as a 32-bit number, as described earlier.

Note that most fields will either be 2, 4, or 8 bytes in length, corresponding to the data types presented earlier (16-bit, 32-bit, and 64-bit respectively). The main exception is the filename field, which naturally could be any arbitrary length.

Directory Archives

Directory Archives are by far the most common structure in use today. As the name suggests, these archives store a directory that lists details about all the files, such as their name, offset and length. These archives are usually simple and very easy to read.

The directory can be stored anywhere in the archive, however it is typically close to the beginning or the end. If the directory is not at the start of the archive, there will typically be a field that tells the offset to the directory, so that you can find it easily. This field is called the directory offset, and is usually found in the header or at the very end of the archive.

Here is a sample graphic representation of this archive structure:

Archive Header  

4 - Header Tag (String)
4 - Number of Files   Directory  

File Entry 1  

 

4 - File Offset
4 - File Size
X - Filename  

File Entry 2  

 

4 - File Offset
4 - File Size
X - Filename  

…  

File Entry n  

 

4 - File Offset
4 - File Size
X - Filename   File Data  

File Data 1  

 

X - File Data  

File Data 2  

 

X - File Data  

…  

File Data n  

 

X - File Data

Split Directory Archives

Split Directory Archives are similar in structure to the Directory Archive, with the main difference that there are multiple separate directories rather than a single collective directory. The way the directories are split is totally up to the individual - so here we will present 2 of the more common split types.

This first example is a split directory, where the first directory contains the offsets and lengths, and the second directory contains the filenames:

Archive Header  

4 - Header Tag (String)
4 - Number of Files
4 - Files Directory Offset
4 - Filenames Directory Offset   Files Directory  

File Entry 1  

 

4 - File Offset
4 - File Size  

File Entry 2  

 

4 - File Offset
4 - File Size  

…  

File Entry n  

 

4 - File Offset
4 - File Size   Filenames Directory  

File Entry 1  

 

X - Filename  

File Entry 2  

 

X - Filename  

…  

File Entry n  

 

X - Filename   File Data  

File Data 1  

 

X - File Data  

File Data 2  

 

X - File Data  

…  

File Data n  

 

X - File Data

This second example is a split directory, where the first directory contains the offsets, the second directory contains the lengths, and the third directory contains the filenames:

Archive Header  

4 - Header Tag (String)
4 - Number of Files
4 - Offsets Directory Offset
4 - Lengths Directory Offset
4 - Filenames Directory Offset   Offsets Directory  

File Entry 1  

 

4 - File Offset  

File Entry 2  

 

4 - File Offset  

…  

File Entry n  

 

4 - File Offset   Lengths Directory  

File Entry 1  

 

4 - File Size  

File Entry 2  

 

4 - File Size  

…  

File Entry n  

 

4 - File Size   Filenames Directory  

File Entry 1  

 

X - Filename  

File Entry 2  

 

X - Filename  

…  

File Entry n  

 

X - Filename   File Data  

File Data 1  

 

X - File Data  

File Data 2  

 

X - File Data  

…  

File Data n  

 

X - File Data

External Directory Archives

External Directory Archives have the same structure as the Directory Archive, however the directory data and the file data are stored in 2 separate files. Naturally, the file that contains the file data is very large, and the directory file very small.

Note that the 2 files both have the same name, but different extensions. The extensions of the files can be anything, however some common extensions for the directory are *.dir, *.fat, and *.idx.

Here is a sample graphic representation of this archive type, where the Example.dir file contains the directory information, and the Example.dat file contains the file data:

Example.dir

Archive Header  

4 - Number of Files   Directory  

File Entry 1  

 

4 - File Offset
4 - File Size
X - Filename  

File Entry 2  

 

4 - File Offset
4 - File Size
X - Filename  

…  

File Entry n  

 

4 - File Offset
4 - File Size
X - Filename Example.dat File Data  

File Data 1  

 

X - File Data  

File Data 2  

 

X - File Data  

…  

File Data n  

 

X - File Data

Chunked Archives

Chunked Archives are a simple structure where the files are stored one after the other. Each file has its own header that gives information about the file, particularly the file size. These archives, probably the simplest of all the archive types, are examined by reading the header of the file, skipping the file data, then repeating again for the remaining files until you reach the end of the archive.

One thing to note: these archives typically don’t store filenames, rather they store a 4-byte String that can be treated like the file’s extension.

Here is a example of this archive type:

Archive Header  

4 - Header Tag (String)   Chunks  

File 1  

 

File Header 1  

 

 

4 - File Type (String)
4 - File Size  

 

File Data 1  

 

 

X - File Data  

File 2  

 

File Header 2  

 

 

4 - File Type (String)
4 - File Size  

 

File Data 2  

 

 

X - File Data  

…  

File n  

 

File Header n  

 

 

4 - File Type (String)
4 - File Size  

 

File Data n  

 

 

X - File Data

Split Chunk Archives

Split Chunk Archives have the same basic structure of the Chunked Archives, however each file is also split up into chunks. Each file chunk is usually the same size (except for the last chunk in each file), which allows efficient use of buffers when reading the file.

Here is a sample graphic representation of this archive type:

Archive Header  

4 - Header Tag (String)   Chunks  

File 1  

 

File Header 1  

 

 

4 - File Type (String)
4 - File Size
4 - Number Of Chunks
4 - Chunk Size  

 

File Data 1  

 

 

X - File Data Chunk 1
X - File Data Chunk 2

X - File Data Chunk n  

File 2  

 

File Header 2  

 

 

4 - File Type (String)
4 - File Size
4 - Number Of Chunks
4 - Chunk Size  

 

File Data 2  

 

 

X - File Data Chunk 1
X - File Data Chunk 2

X - File Data Chunk n  

…  

File n  

 

File Header n  

 

 

4 - File Type (String)
4 - File Size
4 - Number Of Chunks
4 - Chunk Size  

 

File Data n  

 

 

X - File Data Chunk 1
X - File Data Chunk 2

X - File Data Chunk n

Tree Archives

Tree Archives are the most complicated of the archive types, and thankfully they are not used very often. The idea is that the archive tries to store a complete directory tree structure, such as the individual folders. This is usually done by creating a directory for each folder, and linking them together, as you will see in the example.

Here is a sample graphic representation of this archive type, however there can be many variations:

Archive Header  

4 - Header Tag (String)
4 - Number of Folders at the root
4 - Total Number of Files   Folder Entries  

Folder Entry 1  

 

X - Folder Name
4 - Number of Sub-folders in this folder
4 - Offset to the first Sub-folder entry for this folder
4 - Number of Files in this folder
4 - Offset to the first file entry for this folder  

Folder Entry 2  

 

X - Folder Name
4 - Number of Sub-folders in this folder
4 - Offset to the first Sub-folder entry for this folder
4 - Number of Files in this folder
4 - Offset to the first file entry for this folder  

…  

Folder Entry n  

 

X - Folder Name
4 - Number of Sub-folders in this folder
4 - Offset to the first Sub-folder entry for this folder
4 - Number of Files in this folder
4 - Offset to the first file entry for this folder   File Entries  

File Entry 1  

 

4 - File Offset
4 - File Size
X - Filename  

File Entry 2  

 

4 - File Offset
4 - File Size
X - Filename  

…  

File Entry n  

 

4 - File Offset
4 - File Size
X - Filename   File Data  

File Data 1  

 

X - File Data  

File Data 2  

 

X - File Data  

…  

File Data n  

 

X - File Data

As this archive type is quite difficult to explain, I will provide an example here. Lets pretend that our archive contains 3 files, as specified below:

\data\sounds\snd1.wav

\data\sounds\snd2.wav

\data\images\temp\pic1.bmp

The following diagram shows the structure of the archive that contains these 3 files (with the values of each field shown in green)

Archive Header  

4 - Header Tag (String) HEAD
4 - Number of Folder at the root 1
4 - Total Number of Files 3   Folder Entries  

Folder Entry 1  

 

X - Folder Name data
4 - Number of Sub-folders in this folder 2
4 - Offset to first Sub-folder offset to Folder Entry 2
4 - Number of Files in this folder 0
4 - Offset to first file entry 0  

Folder Entry 2  

 

X - Folder Name sounds
4 - Number of Sub-folders in this folder 0
4 - Offset to first Sub-folder 0
4 - Number of Files in this folder 2
4 - Offset to first file entry offset to File Entry 1  

Folder Entry 3  

 

X - Folder Name images
4 - Number of Sub-folders in this folder 1
4 - Offset to first Sub-folder offset to Folder Entry 4
4 - Number of Files in this folder 0
4 - Offset to first file entry 0  

Folder Entry 4  

 

X - Folder Name temp
4 - Number of Sub-folders in this folder 0
4 - Offset to first Sub-folder 0
4 - Number of Files in this folder 1
4 - Offset to first file entry offset to File Entry 3   File Entries  

File Entry 1  

 

4 - File Offset offset to File Data 1
4 - File Size length of File Data 1
X - Filename snd1.wav  

File Entry 2  

 

4 - File Offset offset to File Data 2
4 - File Size length of File Data 2
X - Filename snd2.wav  

File Entry 3  

 

4 - File Offset offset to File Data 3
4 - File Size length of File Data 3
X - Filename pic1.bmp   File Data  

File Data 1  

 

X - File Data the data for file snd1.wav  

File Data 2  

 

X - File Data the data for file snd2.wav  

File Data 3  

 

X - File Data the data for file pic1.bmp

Lets walk through the reading of this file. The color orange indicates the name of a field in the example. The other colors are the same as the colors in the example.

First we read the Archive Header and see that there is only 1 folder at the root. This lets us know that we now need to read a single folder entry.

We read Folder Entry 1, called temp, and are told there are 2 sub-folders. The sub-folders start at a certain offset in the archive.

So we skip to the offset of the sub-folders. For each of the 2 sub-folders, we need to read a folder entry. The first folder entry read is called sounds (Folder Entry 2) and there are 2 files in it. The second entry is called images (Folder Entry 3) and there is 1 sub-folder in it.

So we jump to the offset for the first file entry for the sounds folder,and read 2 file entries, namely snd1.wav (File Entry 1) and snd2.wav (File Entry 2). After we have read these, we jump back to where we were. We have finished with everything in the sounds folder, so we move on to the offset of the sub-folders for the images folder. We read 1 folder entry, called temp (Folder Entry 4), which has 1 file in it.

We jump forward to the offset for the first file entry and read 1 file entry, called pic1.bmp (File Entry 3). We know that the total number of files is 3, so now we have finished reading the tree.

Using this method, we can build up a complex directory tree. This type of archive is usually slightly smaller in size than the plain directory archive, because the filenames don’t have to repeat the entire folder string for each entry, however the compromise is that it takes longer to read because you are jumping all over the place. For this reason, and the fact that it is a very complex structure, only a few games use this type of structure.

Nested Tree Archives

Nested Tree Archives, even though the name sounds hard, are a simpler version of the Tree Archive. The idea is the same as the Tree Archive: store a complete directory tree structure, however the Nested Tree Archive can be read more efficiently as there is no jumping around.

Here is a sample graphic representation of this archive type. The diagram is a little tricky to follow, so read the description and psuedo-code following it for a clearer explanation, then try to read the diagram:

Archive Header  

4 - Header Tag (String)
4 - Number of Folders at the root
4 - Total Number of Files   Entries  

Folder Entry 1  

 

X - Folder Name
4 - Number of Sub-folders
4 - Number of Files  

 

Sub-Folder Entries 1  

 

 

Sub-Folder Entry 1a  

 

 

 

X - Folder Name
4 - Number of Sub-folders
4 - Number of Files  

 

 

 

Sub-Folder Entries 1.1  

 

 

 

 

Sub-Folder Entry 1.1a  

 

 

 

 

 

X - Folder Name
4 - Number of Sub-folders
4 - Number of Files  

 

 

 

 

…  

 

 

 

 

Sub-Folder Entry 1.1n  

 

 

 

 

 

X - Folder Name
4 - Number of Sub-folders
4 - Number of Files  

 

 

 

File Entries 1.1  

 

 

 

 

File Entry 1.1a  

 

 

 

 

 

4 - File Offset
4 - File Size
X - Filename  

 

 

 

 

…  

 

 

 

 

File Entry 1.1n  

 

 

 

 

 

4 - File Offset
4 - File Size
X - Filename  

 

 

…  

 

 

Sub-Folder Entry 1n  

 

 

 

X - Folder Name
4 - Number of Sub-folders
4 - Number of Files  

 

File Entries 1  

 

 

File Entry 1a  

 

 

 

4 - File Offset
4 - File Size
X - Filename  

 

 

…  

 

 

File Entry 1n  

 

 

 

4 - File Offset
4 - File Size
X - Filename  

…  

Folder Entry n  

 

X - Folder Name
4 - Number of Sub-folders
4 - Number of Files  

 

Sub-Folder Entries n  

 

File Entries n   File Data  

File Data 1  

 

X - File Data  

File Data 2  

 

X - File Data  

…  

File Data n  

 

X - File Data

The diagram may seem difficult, but that is mostly due to the fact that it is nested, meaning that it can have as many directories-inside-directories as you like.

This is the way it works. You first read the Archive Header, and see how many folders at the root there are. Usually there will only be 1 folder at the root.

So you read 1 Folder Entry, and find out the number of sub-folders, and the number of files. For every sub-folder, you repeat again from Folder Entry 1. When you have read all the sub-folders, you then read all the File Entries.

If you can read psuedo-code, here is the kind of thing I am trying to describe:

method readArchive(){

read (FolderEntry);

for each (sub-folder){

readArchive();

}

for each (file){

read (FileEntry);

}

}

So, you begin by reading a FolderEntry. If the entry has sub-folders in it, you must immediately read the entries for those sub-folders, by repeating the process from the beginning. When all the sub-folder entries for this FolderEntry have been read, you can then progress and read the FileEntries for the folder.

Checking Your Results

Common Types Of Fields

Archives can literally contain fields for just about any purpose, however it helps to know some of the more common fields so you know what to expect.

The following fields are very common in archives, so there is good probability that you will run into most of these.

  • File Size
  • File Offset
  • Number Of Files
  • Header Tag
  • Filename
  • Padding Multiple

The following fields occur in some archives, but at significantly less probability compared to those listed above.

  • First File Offset
  • Archive Name
  • Filename Offset
  • Filename Directory Offset
  • Total File Data Size
  • Total Directory Size
  • Archive Size
  • Number Of Directories
  • Directory Offset
  • File Extension / Type
  • File ID
  • Archive Version
  • Filename Length
  • Decompressed File Size
  • Checksum
  • Timestamp

File Size

This field gives the length of the data for a particular file. Sometimes, if each file has its own File Header, this length is included in the file size, so you may need to do some minor subtractions to get the length of the file data only.

The File Header is any fields the appear just before the data for each file. Most archives do not have File Headers, but if they exist then they will usually only be 1 or 2 fields long.

File Offset

This field tells you the position of the file data in the archive. Depending on the archive, this field will be either an absolute offset, or an offset relative to a certain position. If it is relative, you will need to add some value to the offset in order to obtain the correct value.

An absolute position is the exact offset from the beginning of the archive. In other words, if you go to this offset, you will be at the start of the file data.   A relative position means that the offset is relative to a certain position. If, for example, the offset is relative to the position 2048, then you would need to add 2048 to the offset to get the location of the file data. If this is the case, there will usually be a field that tells you the relative position.

Number Of Files

This field is one of the more important fields - without knowing how many files there are in the archive, you Wouldn’t know when to stop reading. Almost all archives have this field, usually in the Archive Header.

The Archive Header is the fields that occur at the very beginning of the archive, and contain general information such as the number of files, the header tag, and the length of the file data.

Some archives that do not have this field, instead have a Directory Length field, that tells you the length of the entire directory. If each directory entry is a fixed length, then you can find out the number of files by doing DirectoryLength / EntryLength. For example, if each directory entry is length 28 and the entire directory has length 280, then there are 280 / 28 = 10 files in the archive.

For External Directory archives where the directory entries are a fixed length, the number of files is the DirectoryFileLength / EntryLength. For example, if the file that contains the directory is 580 bytes long, and the file entries are 58 bytes long, then there are 580 / 58 = 10 files in the archive.

Header Tag

This field serves as a way to identify the particular archive format used by a file. It is typically read as a string, and is usually in upper-case characters, although it is occasionally a group of preset byte values. This field is almost always at the very start of the file, but rarely it will be the second or third field.

Although there is no guarantee that a file with a particular header tag is a specific format, it is usually a reliable assumption. Some common header tags are found in Appendix 5.

Filename

When a game loads a resource into memory, it needs a way to uniquely identify the file amongst all others. A common way to do this is to store the filenames - it is guaranteed to be unique and allows people reading the archive to known the file’s purpose.

A filename is usually an ASCII string, but can occasionally be Unicode, particularly in big budget games that will be released internationally, or games that are developed in non-English speaking countries.

Filenames are obviously arbitrary in length, so they often prove to be an annoying thing to deal with. These are some of the common ways filenames are stored in an archive:

  • Fixed Length String

The game developer may have specified a fixed length for all filenames, in which case the filename can just be stored normally. Usually, if this is the case, all filenames will be exactly 12 (which allows for 8 filename characters and 3 extension characters.)

  • Fixed Length String with Padding

A fairly common technique is to specify a maximum size for a filename. If a filename is too short, the remaining bytes will be nulls.

An example: we have specified that the maximum filename length is 20 characters. Our filename is example.dat, which is only 12 characters long, so therefore the remaining 8 bytes are all nulls.

Very rarely, the filename will be terminated with 1 null byte, and the remaining space filled up with random "junk" bytes. This can make it difficult to analyse the archive, but doesn’t make it any harder to read by a program.

An example: the maximum filename length is 20 bytes. Our filename is example.dat, which is 12 characters long. We would write the filename, followed by 1 null byte to indicate the end of the filename. The remaining 7 bytes can be assigned randomly.

  • Null-terminated String

The filename is stored normally, with a single null byte at the end of the name. Therefore, to read the filename, just keep reading until you reach a null byte.

  • Null-terminated String with Padding

Like a normal null-terminated string, there is a null byte at the end of the filename. However, buffers are often used to read archives efficiently, so the filename may be padded. The padding multiple is usually 4 bytes, but can be any value. What this means is, if the FilenameLength + 1 for the null terminator byte is not a multiple of 4 bytes, additional null bytes are added to make it the correct multiple.

An example: if our filename is example.dat, we can see that the length is 12. Add 1 for the null terminator, and our length is 13. 13 is not a multiple of 4, so we need to add 3 more null bytes to make is a length of 16 (which is a multiple of 4). Therefore, to store this filename, it would be example.dat followed by 4 null bytes (1 for the null terminator, and 3 to pad it out to a multiple of 4 bytes)

  • Byte-terminated string

A variation to the Null-terminated String technique is where the filename is terminated using a byte other than null. One such common case is the use of byte 32 as the terminator (which is the space character). This is relatively rare.

  • Filename Length Field

Some archives make it nice by adding an extra field before the filename that tells us how long it is. The field will usually be either 1, 2, or 4 bytes long, and is almost always the field just before the filename.

Sometimes the filename field itself will still contain a null terminating byte, which may or may not be included in the filename length field. The filename length field may also be used in conjunction with any of the other filename techniques mentioned above.

  • Filename Offset Field

If the filenames are stored in a separate directory, a common technique is to store the offset to the filename along with the other fields like the file offset and file size. When you go to the filename offset, you would read the filename using one of the methods above, usually a null-terminated string.

The filename offset field is almost always a relative offset - relative to the start of the filename directory - so you would need to add the filename directory offset to this value in order to locate the actual filename.

Unicode strings are read the same as normal strings, using the techniques above. There are some things to watch out for though.

If there is a Filename Length field, the value will also be the number of characters, not the number of bytes. Therefore, you need to multiple the value by 2 to get the actual number of bytes to read.

If the string is null terminated, it will usually be terminated by 2 null bytes, not 1. This is because Unicode strings are read 2 bytes at a time, so therefore you need the extra null byte.

An alternative to the storage of the filename, is to store a Hash. A hash is a unique value that is calculated from a series of input bytes. Most hashes are either generated from the filename of a file, or from the file data itself. The hash calculation generates a different value according to the bytes it receives, and the order of the bytes.

The benefit of a hash is that it is quick and efficient to read. A hash will always have the same number of bytes regardless of the input to the hash function, so it always has the same length when stored in a directory. The hash is almost guaranteed to be unique, so there are no risk of 2 different files having the same hash. Finally, a hash is numerical rather than characters in a string, so a hash can be looked up and stored better by a computer.

Hash fields are usually always 4 bytes, but 8 or 16 bytes are also common in some archives. They are easily identifiable because they will appear as random bytes that don’t equate to any usable value. For example, when you determine the value of the hash field, it will be obvious that it isn’t a file length or file offset because it will usually be a very large value.

Hashes don’t usually contain null bytes either, so they stand out when stored in a directory with fields like file length or file offset (which have many null bytes in them).

The downside with hashes are that once you have a hash, there is no way to convert it back into a filename. Therefore, if the archive uses hashes, you won’t easily be able to tell the type or purpose of each file, making it hard to do anything with them.

Padding Multiple

In order to read archives efficiently, padding can often be used. We have seen a form of padding surrounding filename fields, as discussed a little earlier, however the padding we are talking about here is file data padding. Padding basically says that all file data will be stored in an archive in blocks of a certain size.

Padding, when used in conjunction with buffering in the game, is both quick and efficient. Without buffering, every time you read a byte from a file, the hard drive needs to start up, locate the file, and read the byte. This is a very slow process, in comparison to most other computing - the hard drive is one of the slowest things in a computer.

When buffering is used, the game will still read data from the hard drive, but it will read it in large blocks rather than individual bytes. Therefore, the hard drive does not have to start up and find the file data as often, which increases the speed of the program.

Padding works in conjunction with buffers to provide this speed-up. If we have a buffer size of 2048 bytes, then we would build our archive so that each file is padded to a multiple of this value. Thus, when reading a file, we would continually read blocks of 2048 until we have the entire file.

Most files do not have a size that is a multiple of the padding size, therefore we add padding bytes to the end of the file data to increase it out to the correct padding multiple. The padding bytes are usually nulls.

In an archive, padding can usually be identified easily. One way is to look at the space between the end of one file, and the start of the next file. If they end and start at the same offset, there is no padding, but if there is a big section filled with (usually null) bytes then there is padding.

Padding can also be identified in the directory. If the file offset fields have a null byte for their first byte, or if the offsets are all even numbers, there is a high probability of padding. Also, if you add the first file offset and the first file size, but they do not equal the second file offset, then padding is probably used.

Padding is used significantly in XBOX games, and occasionally in some other game archives. The most common padding multiple if 2048, which is the one used in XBOX games. Some other padding sizes, although much rarer, include 512, 128, 64, or 32 bytes, however the padding could literally be any size that the archive developer chooses.

An example of padding. Lets assume that we have a padding multiple of 2048 bytes, and one of our files has a size of 5160. The size 5160 is not an even multiple of 2048, therefore we will need to use padding. Firstly, we store the file normally and completely. Then, we need to determine the size of the padding. The next multiple of the padding larger than 5160 is 6144 (2048 x 3), so we need to have 6144 - 5160 = 984 padding bytes. So, after the file data, we would write 984 null bytes, thus increasing the size of the file to a multiple of the padding size.

Validating Your Fields

When you think you know what a field means, it is important to validate your findings. At the very least, you can write a program for your format specifications, and see whether it works. However, there are some other methods to check your specifications as you are writing them.

One of the simplest to check are the file offset and file size fields. If you are presented with these two fields, simply add the first file offset to the first file size and see that it matches the second file offset. Repeat this a few times, and you can pretty much guarantee that those 2 fields are correct. If the archive uses padding, the first offset + first size will be slightly less than the second offset - if they are close but not quite equal then you have probably got it.

Another field that is easy to validate is the file offset field, if you choose a good archive. All you need to do is go to the offset for each file and see if it points you to a known file header. For example, if you open a sound archive, then a good header to look for is RIFF as it indicates a *.wav sound file. Similarly, if opening a texture archive, look for common image headers such as BM (*.bmp), GIF (*.gif) and JFIF (*.jpg). So if you pick the right archive, you can see whether the file offset field is correct.

If you think that an archive compresses its files, and you have found the file size field, try looking for a decompressed file size field for each entry – simply look for a field that is always a little larger than the size for the file.

Note that the best general compression techniques still aren’t particularly successful. Therefore, even though the decompressed file size field will be larger than the file size field, it shouldn’t be exceptionally large. There are many other fields like hashes that generate large numbers, but these fields will have values that are much larger than any compression algorithms can achieve.

If you locate a directory in the archive, try to find a constant file entry size if possible. For example, if each file entry contains a file size field and a file offset field, then each file entry has a size of 8 bytes. Once you know this, you can then determine the size of the directory by finding the offset to the end of the directory and subtracting the offset to the start of the directory. When you divide the directory size by the file entry size, you will be able to find out the number of files in the archive. This number may be stored in the archive somewhere, usually at the start of the archive, so look out for it. Fields for the start of directory offset, end of directory offset, file entry size, or directory size might also be included in the archive header.

Encryption and Compression

When an archive is being developed, there are many different things that need to be considered. If an archive is very large, and you want it to be stored for efficiently, it is common to perform compression on the files, thus reducing their size on the hard disk. If the archive contains copyright material, or you don’t want people to access the archive contents, you can apply encryption to the archive or its files. Here we will present you with some of the ways to identify the techniques used by the archive, and how to work around them.

Please note that compression and encryption are very complex problems, and there are literally thousands of different ways to do this stuff. Here we try to give you a brief understanding, and some helpful pointers in the right direction, but this will be in no way complete or comprehensive. If you are interested in encryption or compression specifically, we suggest you locate a book or website devoted to this topic solely.

Bitwise operations.

Bitwise operations can be regarded as simple logical steps where a simple comparison is made between 2 bytes, resulting in a single new byte. You can thing of this like any normal mathematical function, such as addition, where you are given 2 numbers and end up with 1 result. The primary operations are AND, OR, and XOR, however there are also some relatively uncommon operations such as NOT, SHL and SHR.

Bitwise operators are performed on bytes, but the comparison occurs at the bit level. Most, if not all, programming languages have pre-assigned methods for performing bitwise operations, which means that the programmer doesn’t need to do anything manually at the bit level.

The AND operation sets a bit to true only if both bits are true. The table of possible values is as follows:

0 AND 1 = 0

1 AND 0 = 0

0 AND 0 = 0

1 AND 1 = 1

Example: 12 AND 123

00001100 (12)

01111011 (123)

(AND)

00001000 (8)

The OR operation sets a bit to true if there is any true bit in the operation. If there are no true bits, the value is false. The table for this is shown below:

0 OR 0 = 0

1 OR 0 = 1

0 OR 1 = 1

1 OR 1 = 1

Example: 12 OR 123

00001100 (12)

01111011 (123)

01111111 (127)

The Exclusive OR (XOR) operation sets the resulting bit to true only if there is a true and a false in the operation It can be thought of as being true if the bits are different, or false if they are the same. XOR is probably the most common bitwise operation, as it can be reversed by doing the same thing again.

Here is the table for this operation:

0 XOR 0 = 0

1 XOR 0 = 1

0 XOR 1 = 1

1 XOR 1 = 0

Example: 12 XOR 123

00001100 (12)

01111011 (123)

01110111 (119)

If we have the value 119, we can now XOR it with 123 again to get our original value 12, thus the reason for its popular use.

Example: 119 XOR 123

01110111 (119)

01111011 (123)

00001100 (12)

The NOT operation, and the operations following, apply to only one byte. The NOT operation gives the opposite of the bits value, as in the following table:

NOT 0 = 1

NOT 1 = 0

Example: NOT 12

00001100 (12)

11110011 (243)

Like XOR, this can also be performed again to retrieve the original value.

Example: NOT 243

11110011 (243)

00001100 (12)

The shift-left (SHL) operation moves all the bits to the left by a certain amount. The maximum shift for a byte is 7, because when you shift by 8 bits it doesn’t actually change.

Example: SHL 51 by 1

00110011 (51)

01100110 (102)

Note that the bits are cycled around. Effectively, for each shift-left, you just take the left bit and put it at the end. So, if you do a shift-left of 3, you cut the first 3 bits from the value, and put it at the end.

Example: SHL 51 by 3

00110011 (51)

10011001 (153)

The shift-right (SHR) operation is the same as shift left, excepting that the bits are moved right instead of left.

Example: SHR 51 by 1

00110011 (51)

10011001 (153)

Like before, the shift is cyclic, so the bits are cut from the end and placed at the beginning.

Example: SHR 51 by 3

00110011 (51)

01100110 (102)

Purpose and uses of Bitwise Operations

Bitwise operations, as will be discussed further, are one of the main underlying principles to encryption, however they have other purposes:

  • SHR can be used to quickly divide by the number 2, and SHL can multiply by 2, provided you discard the bits that fall off the end instead of moving them cyclic.
  • Images and graphics card hardware can store colors more efficiently by using SHR and SHL operations, with the sacrifice of a small amount of color detail. If we are using 32-bit color, we assign 1 byte to each of the RGBA colors (red, green, blue, alpha). However, if we are using 16-bit color, we need to store all RGBA colors in only 2 bytes. To do this, we can store red in the first half of byte 1, and green in the second half. Similarly with the blue and alpha colors. To access the specific color values, we would just do a SHL or SHR by 4.
  • AND, XOR, and OR are used extensively in most encryption techniques, and come compression techniques, such as the lossy compression used by some image formats.

The XOR operation is easily the most common of the bitwise operations - it can effectively scramble file data so that it appears unreadable, and can be done by the underlying CPU almost instantly, so it is quick and effective. There is a very good bet that XOR will be used somehow by the encryptions you are likely to encounter in game archives, so this should always be your first point of call.

Encryption

Firstly, lets answer a very simple question: what is encryption? Encryption is a collection of techniques that can be applied to any type of data, with the purpose of hiding the actual data from ordinary users. It can be thought of like a password - in order to read the data you need to first know the password. Once you know the password, the data is converted into normal readable information. Our password is a group of things such as the actual encryption method, values that need to be used, and other factors such as the archive structure.

So, in a more practical sense, encryption techniques are applied to the data of an archive or file, converting it into something that is unreadable. The game is able to reverse this process to retrieve the actual file data - this is what we want to do too.

Encryption is a very old process, and it isn’t restricted to computer data - in fact you probably would have been doing encryption yourself when you were a child. One example is using lemon juice to write invisible messages on a piece of paper. Another example is to write a message using a code, such as the following:

A=B, B=C, C=D, … X=Y, Y=Z, Z=A

In other words, instead of writing the work CAB, you would write DBC. In order to get the actual word back again, you just need to reverse the process:

A=Z, B=A, C=B, D=C, … Y=X, Z=Y

Which will give you the original CAB message back again.

This process is similar in most game archive encryption techniques, however it will be applied to all possible byte values rather than restricting it to the English alphabet.

How to tell when data is encrypted

When data is encrypted, the bytes will be changed so that they no longer resemble the original data. In other words, the data will appear as gibberish or junk. Therefore, it can be hard to locate encrypted data, as it will look like any normal file data.

The first thing to do is try and find a directory for the archive. If you cannot find the directory, it may be encrypted, compressed, or might not exist at all. If this is the case, you can’t really progress much further other than to experiment around yourself with some different encryption techniques incase you find something.

In many archives that use encryption, you will be able to find a directory, but the filenames are encrypted. You can read the file offset and size fields, but the filename makes no sense. Filenames are usually encrypted to hinder people knowing what the files are used for, which makes them effectively useless. However, thankfully most filename encryptions can be broken easily enough, partly by exploiting the properties of the filename itself, which will be discussed further on. Encrypted filenames are identifiable because you can clearly see the directory, but there is a block of bytes in each directory entry that doesn’t make any sense.

The other main technique is to encrypt the actual file data. This can sometimes be a problem, but it is made easier if you know what the file data is supposed to represent. For example, if there are filenames in the archive, and you know the file you are looking at it supposed to be a *.wav file, then you would expect to see the header tag RIFF at the start of the file data. If you don’t, maybe the file has been encrypted. If you do not know what the file data is supposed to be, you can’t tell easily whether encryption is used or not.

As you can probably see so far - half the information needed to break encryption is knowing what the data is supposed to look like in its unencrypted form.

Some games can help with this process - if the game keeps a cache file of the resources it uses (such as in the game Half-Life 2 which has files called *.cache), you may be able to find the raw data for some of the files. If you are able to match a file in the unencrypted cache to a file in the archive, then all you need to do is find a way to transform the encrypted data into the unencrypted data.

If you have the raw data for a file, but you don’t know what file in the archive it matches to, then you can find it relatively easily by searching through the archive directory for a file of the same size - there shouldn’t be too many.

There are many other possible sources of unencrypted files:

  • Game update files or patches may be unencrypted, especially for games by smaller companies
  • Demos of beta-versions of the game
  • If you can use a memory reader, you could run the game so that it loads the resources, then grab the unencrypted data from the computers memory.
  • Many games allow you to put files outside the archive and it will use them in place of the same file in the archive, so if you can find a file in the game directory with the same name as a file in an archive, you may have found a match.

How to break basic encryption

If you have found data that it almost certainly encrypted, you naturally want to decrypt it so that it can be used. Here are some techniques to try, especially if you cannot obtain the unencrypted form of the data. These will be shown as if decrypting a filename, as they are the easiest and most common form of encryption, however the similar process can be used for decrypting anything - provided you know something about the file to begin with. As mentioned earlier, if you were trying to decrypt a file you suspected was a *.wav audio file, you can try these techniques on the first 4 bytes of the file data, which you know must be RIFF in order to be a *.wav audio file.

Encrypted filenames have properties that make it relatively easy to decrypt. For example, filenames all have a "." character in them, and they are usually 4 characters from the end of the name. Using this, you can assume that the 4th encrypted character from the right is going to be the ".", thus you just need to find the way to change it over.

Single-value Encryption

If every 4th-last character is the same byte value, it is probably a simple XOR encryption with a set byte. For example, if the 4th-last character is always the byte 67, then you would find the encryption using the following:

01000011

 

byte 67 00101110

 

byte 46: "." 01101101

 

the XOR value is byte 109  

You can therefore XOR every character in the filename with byte 109, and the filename will be decrypted.

If we know each filename has a "." For the 4th-last character, but the 4th-last character from different filenames does not have the same value, then there are usually 2 possibilities - either the encryption is offset-based, or uses a repeating group of bytes.

Offset-based Encryption

Offset-based encryption is where the byte itself is XOR with the offset of the byte in the file. So, for example, if the "." in the filename is at offset 167, then the byte should be XOR with byte 167. This is relatively easy to discover, and is actually a specialized version of the following technique, but is obviously easy to detect and decrypt.

Repeating Group Encryption

This is one of the harder encryption types, and occurs by basically having a set of bytes that are used over and over for the XOR calculation. For example, your repeating group might be the bytes 34, 156, 16, 234. In this case, the 1st, 5th, 9th, etc. bytes of file data will be XOR with byte 34. Similarly, the 2nd, 6th, 10th, etc. bytes of file data will be XOR with 156.

To determine the XOR pattern, you would basically use the "." character of the filename again. Assuming that every 4th-last character is a ".", you may find the that the bytes at every 4th-last position are different. All you need to do is XOR these values with the bit value for the "." character.

An example: Lets say our original filenames are the following:

Sky.wav

File.dat

Image.gif

Ground.jpg

In our archive, the filenames are encrypted. We have determined that the 4th-last character of each encrypted filename is as follows: 99,145,70,99.

Note that the position of the "." in our filenames are 4,5,6,7 and that our examines byte values for the first and last filename are both 99. This would probably tell us that the repeating pattern is 3 bytes, because they both have the same XOR value.

Now, we need to XOR all the encrypted samples we took against the byte value for the "." character, which is 46. Doing that, we get the following:

99 XOR 46 = 77

145 XOR 46 = 191

70 XOR 46 = 104

Thus, our repeating pattern is 77,191,104. Now we can decrypt our filenames using this repeating pattern.

How to break complex encryption

If you haven’t had any success thus far, it is possible that the encryption is more complicated. This means that it would probably be easier to find a reference to the encryption technique, rather than brute-forcing the encryption as we were above.

Looking for executable strings

Disassembler programs are tools that allow you to look at the contents of an executable application. In other words, it can show you what happens in an *.exe file. Even though it is often hard to use a disassembler, good ones will have a feature that extracts the strings from an executable. If you can obtain the strings, you can quickly browse through them for a indication of the encryption, or even things like unencrypted filenames.

If, for example, you found a filename called pistol.bmp, you could then know that there is at least 1 file in the archive called pistol.bmp, which could help you with some of the techniques in the previous section.

If you can find any strings such as copyright statements, websites, or names of people, they could all help you locate the actual encryption method. You could also find hints inside general game files such as the readme, or in the game credits. Once you have a piece of this information, it is just a matter of locating the technique, such as by doing a web search.

Reverse Engineering the encryption method

Disassemblers, although hard to use, can help with encryption that is very complex. If you can pinpoint the location of the encryption code within the executable, it will either tell you exactly how to do it, or at least help you understand how it works.

A worked example: Painkiller *.pak encryption

Note that this is a fairly complex encryption, but isn’t too hard to determine.

When the developers of the game Painkiller first released a demo, fans quickly discovered that the game used adapted PKZip files to store the resources. However, the coders obviously did not want fans to have access to the resources, because in the second demo and the first retail version, they used a more "difficult to hack" method of storing resources.

While they used straight-forward fields such as the file size and file offset variables, they encrypted the filenames of the resources. In addition, they changed the compression method of the resources to Zlib compression. Without too much trouble, people had managed to work out the whole format, and could extract the resources, but they couldn’t decrypt the filenames.

In the archive scripts.pak the first few encrypted filenames strings are as follows:

cLHLCB.P[\XIM.m,)

lOOO\A.[WVSJ/4j/& _

iJRRYD.IR\*&1/&'(a8=<

Ok, now we know that the strings represent some king of encrypted text. We also assume these strings are in fact filenames. Filenames usually have the following structure:

Directory\filename.extension

The number of directories may vary, and the extension is usually 3 bytes in length.

Now take a good look at the strings. You will notice that they all start with a single byte of lower-case text, followed by several characters of upper-case text, then there is a symbol . This easily maps to the filename structure shown above ie. capital character, followed by small characters, followed by a backslash.

In ASCII, symbols are mostly separated from the normal English characters, so we can assume that the symbols in our encryption will probably refer to a directory slash - either a \ or a /.

Close examination of the other characters in the strings reveal that the last three characters of each string are preceded by a byte value in the 96+ byte range. In our standard filename structure, the last three characters commonly represent the extension of the file, with the character before the extension always a . , so it’s safe to assume the following structure of the encrypted strings:

<Capital, lower-case>\<lower-case>.<lower-case>

For the first encrypted string, it may be represented as follows:

<cLHLCB>\<P[\XIM>.<m,)>

Well, this is very nice, but we still don’t know what cLHLCB stands for. Let’s consider the usual XOR technique. We know that we have a \ or /, and a . in our strings. We can use this to find out the value used to XOR these bytes. For example, the . character XOR with the equivalent character in our first filename gave us the byte value 67. The \ character XOR gave us the byte value 64, and the / character gave us 51.

Now we don’t know whether the \ or the / is used for directories, so we should look and see what we can determine. The distance between the slash and the dot in our first filename is 8 bytes. The difference in byte value between the . and \ is 3 bytes, and between . and / is 16 bytes. Note that 8 = 16/2, so maybe it is as simple as adding two to the XOR value for each subsequent byte?

We can test this by starting from the forward slash and XORing the string characters to the last character in the string. Thus, we will use byte 51 for the forward slash, then use 53, 55, 57, 59, 61, 63, 65, 67 (our .), 69, 71 and 73 respectively for the next characters:

.P[\XIM.m,) = /electro.ini

Aha! So, it is indeed like that. Now can do this for the rest of the string:

cLHLCB.P[\XIM.m,) = Decals/electro.ini

So the first string is successfully decrypted! In this first string, we started by using byte 39 for our XOR, but using this same starting point the next filenames are not decrypted properly. Therefore, there must be a different starting point for each filename, and some way to determine what that starting point is.

We check the next 2 filenames using the method we used for the first, and find the following:

lOOO\A.[WVSJ/4j/& _ = Decals/molotov.ini

iJRRYD.IR\*&1/&'(a8=< = Decals/rockethole.ini

Using this, the starting byte for the second filename is 40, and 45 for the third. We assume that the game can somehow calculate the starting points based on variables in the archive. When you look at the directory of the *.pak file, you will see file offsets, file sizes, and the length of the filename, among other stuff. Now compare the length of the filenames: the first is 18, the second also 18, and the third is 21. Compare this to the starting points, which were 39, 40, and 45. Notice how the first and second filename have equal length, and their starting point is only a difference of 1.

To start with, let’s propose that the way to determine the start point is somehow related to the length of the filename. Filename 1 and 2 both have the same length, so in theory they should end up with the same starting point, however the second filename has a starting point 1 higher than the first filename.

Well, perhaps the method will take into consideration the position of the resource in the file (the first one is file 1, the second file 2 and so forth). When the starting point would be calculated, the final value could be incremented with the position in the file. This way the difference between the starting point of filename 1 and 2, having the same length, would indeed be 1. This then implies that the value used to calculate the starting point for the first string would actually be 39-1=38, the second 40-2=38 and the third 45-3=42.

Assuming the above is correct, then how can we obtain the value of 38 for the first string? This is not easy, but we just try a number of ideas. The length of the first filename is 18. If we do a SHL of this number we get 36. This is rather close to 38, isn’t it? How about the third filename? A SHL of this gives 42. This is exactly the starting point value of the third filename.

Thus, if we calculate it this way for the first and second filenames, we get a value that is 2 lower than what is should be, and for the third the difference between our calculation and what it should be is 0. Perhaps we are mistaken, and the method is different? Well, we are trying to understand, and obviously we haven’t cracked it completely. We will still stay on track though and keep the shift left as it is rather close to the actual starting point value.

More information is needed at times like this, and it is advised to apply your proposed methods on many cases of whatever is encrypted. In our case, we must check more strings to map potential differences in the SHL values. We will not show it here, but we’ll present the number of possibilities that you will get if you do so. We find that our SHL of the length variable differs from the starting point value in this range:

SHL(length) - starting point = {-2, -1, 0, 1, 2}

So the maximum difference is 2 less than the starting point, or the other way, 2 more than the starting point.

Apparently, the encryption process has some way of telling when to add or subtract these values from the result of the SHL operation, as if looking them up from a table. How would this table look like, and how would it know where to look in the table? The only way to get a hint of this process is by examination of multiple strings and comparing the starting points with the size variable, as we will assume that the length is also needed to look up the variable from the unknown table (as filename 1 and filename 2 only have the length in common, this shows that the encryption uses only this variable to encrypt).

So, make a table from a size variable of 0 upwards. Look at the strings and find the starting point, write down the range value it used (-2 or -1 and so forth). Well, you won’t find filenames of length 0, but just fill in those that you do find.

If you do this, you will discover the following table:

Size

Code 0

-2 1

-1 2

0 3

1 4

2 5

-2 6

-1 7

0 8

1 9

2 10

-2 11

-1 12

0 13

1 14

2 15

-2 16

-1 17

0 18

1 19

2 20

-2 21

-1

and so forth.

So you can see a nice pattern here. Let’s take the first string. It has a length of 18. That means the code will be 1. The starting point will therefore be calculated like this:

First file:

 

SHL(18) + 1 (from file 1) + 1 (Code) = 39 Second file:

SHL(18) + 2 (from file 1) + 1 (Code) = 40 Third file:

 

SHL(21) + 3 (from file 1) + -1 (Code) = 45

These are the XOR values that will be used on the first character of the filename. For each subsequent character this value will be incremented by 2.

Encryption methods are with many, but so are your brain cells. The Painkiller *.pak filename encryption is just one of many ways to encrypt, and there’s no universal tool or process to decrypt all of them. As the uncovering of this encryption method should show, there’s a lot of guessing and second-guessing needed to solve the puzzle, besides a logic mind. You should train yourself in recognizing logical patterns; think along lines of file structures, bytes and bits. By using a pen and paper you can write down notes, compare things more easily, write down binary values, and try different logical methods to get where you need to be.

Compression

The purpose of compression is to shrink the length of the file data in an archive, so as to store as much data as possible in a small amount of disk space. Compression can also be used to increase the speed of loading files, as you only need to read a small amount of data to retrieve a complete larger file. This comes at a cost though, as you need to have a fast CPU to do the decompression quickly.

Many game archives choose to use standard compression types, so these will be covered here. For those that use a custom compression algorithm, the process of decompressing these files can be extremely difficult, but we will try to provide some pointers.

How to tell when data is compressed

Compression, just like encryption, is not easy to detect - compressed data is random and often looks just like any typical file. The most reliable way to determine file compression is to look in the archive directory for a Decompressed Size field. If this field exists, it is a certainty that some form of compression is used. You way also be able to determine possible compression method used by looking at the compression ratio.

Most other ways to detect compression are to look at the game executables themselves, such as using disassemblers or looking through other game files - this techniques are discussed earlier in relation to encryption, but the process is still the same.

ZLib Compression

The vast majority of compressed files in archives use ZLib. This compression method has many advantages, such as high compression rates, fast decompression, free to use, and open source.

More information about ZLib, including programs that can be used to read or write ZLib-compressed files, can be found at http://www.gzip.org/zlib/

Detecting ZLib compression is thankfully nice and easy, as it has a single-byte header. If your file data starts with the character x, and it is a compressed file, it is almost guaranteed ZLib compression.

PKZip Compression

Many games nowadays, particularly games that encourage a degree of modding, use the standard PKZip compression to compress individual files, or just pack all their files into an actual *.zip file. However, these games will usually change the extension to something other than *.zip - a common extension change it to *.pk3. A PKZip archive can be detected by the header tag PK, which occurs at the beginning of the archive, at the beginning of each file, and in the end directory if it exists.

Information about the PKZip format, including programs and specifications, can be found at http://www.pkware.com/.

Some games also use PKZip compression, due to it being a standard and successful compression format, but still don’t want people to access the files - a technique which is being used in a few games is to slightly modify a *.zip archive so that it isn’t openable by normal zip programs. One method is to encrypt the entire archive, usually using a single-value XOR (as discussed earlier), or to change the PK header into a different header like QL.

To determine whether an archive is PKZip compatible, it is usually quickest to change the extension to *.zip and try to open it using a standard zip program like WinZip.

The WinZip program is a very common implementation zip reader, and can be used for free from http://www.winzip.com.

Other Standard Compression Types

Although not very popular at the moment, some games try to increase compression rates by using other semi-standard compression types. Some of the ones that have been used include generic TAR archives, and archives using the WinRAR library. Expect to see some of these compression archives to appear in upcoming games.

WinRAR is similar to the zip compression methodology, and achieves similar compression ratios. Information on the RAR format, and the main program for reading these archives, can be found at http://www.win-rar.com/.

How to break unknown compression

If you have not been able to determine the compression type from one of the above techniques, chances are the compression has been developed in-house by the game developer. The Electronic Arts subsidiaries like to develop their own compression types, however many other developers choose to stay with standards.

Run-Length Encoding

If you are lucky, the custom compression may simply be a RLE variation, which is thankfully relatively easy to determine.

RLE stands for Run-Length Encoding, and is basically a way to compress a file by removing repetitions. This technique can sometimes be successful, particularly on plain-text files. This compression is also commonly used for image files, such as BMP images.

A RLE-based method is often found by looking at compressed text files. You will usually be able to read the first few lines of the text, but there will be some characters at random positions throughout the text that disrupt the regular flow. For example, you could have a piece of text like this:

^Click the OK but$ton to ex$it the ga(me.

It is pretty obvious what the string is supposed to say, and the symbols amongst the letters say that they are some kind of control character, indicating what to do next. In the example above, the ^ symbol would mean read the next 16 characters normally, the $ would mean read the next 9 characters, and the ( is read the next 3 characters.

A control character is a (usually 1) byte that tells the decompression function what to do next. Some common control character meanings are: - read the next X bytes - repeat the next X bytes N times - go back to a previous position, and copy X bytes

Obviously, the example above didn’t compress the file much, in fact it actually made it longer. An example which would be much better is that below:

I know why it is $&very hot today. Because *( summer!

This string is not immediately obvious in its meaning, especially at the end because the words because summer are not grammatically flowing. Therefore, the symbols *( would probably mean that words have to be inserted here from somewhere else.

What we need to do is determine what each of the control characters mean. The easiest and most reliable method is to start from the beginning of the file, ensuring that the file is plain text. If you also know what the file is supposed to say in whole, such as by finding the string actually being used in the game, then it will be much easier. Also take note of some of the common control character meanings, as presented in the gray box above.

In our example, we would first notice that the control characters come in groups of 2, which probably mean that the first byte tells us what to do, and the second byte would say some kind of variable related to the function.

The first control character we come across is $&. In practice, the control characters can be anything, and any number of bytes, so don’t assume that this example is totally accurate - rather use it as a sample insight into the possibilities.

In our example, the character $ means to repeat something, and the & character would say 2 things: the first 4 bits say the next 5 characters are to be repeated, and the last 4 bits say to repeat 2 times. This brings up an important thing to note when dealing with this type of compression: often you need to look at the bit level rather than the byte level.

Still using our example, we move along to the next control character. The first 4 bits of the * character says that the function is to copy text from a previous offset, and the last 4 bits say to copy 6 characters. The ( character gives the number offset to the start of the data to copy, which is a relative offset going backwards from the current position. So, in this case, the ( means the value 34. So, *( says go back 34 bytes from the current offset, and copy the next 6 characters.

If we then decompress the entire string, we would end up with the following:

I know why it is very very very hot today. Because it is summer!

Now if we compare the length of the compressed string to the decompressed string, we can see that the compressed string is 14 bytes shorter, thus compression has been achieved.

The key to this technique, which is very successful for text data, is to identify all the possible control characters, and then determine what they mean.

Looking for executable strings

As with encryption, searching an executable file for strings can be really helpful, even if the compression is custom. For example, Huffman tables are used in many compression techniques, and as such there is usually some functions that have Huffman written in them. If you find something like LZHUncompress, or a string that mentions an LZ variant, you should look around on the internet for some LZ decompressors to see if you can find an exact match.

You could also simply delete or rename an archive and then try to start the game - the program might display an error message telling you the compression type. Also check readme files and in-game credits, as they will often list the products that are used in the game, such as Bink Video Compression.

Reverse-engineering the compression method

If the compression is custom, then you have a real problem. Compression is much more complex in comparison to encryption, and thus not many people have much success trying to decompress these files. The main and most successful method is to reverse-engineer the game executables to try and locate the compression function (using a disassembler). If this can be located, most people will just cut and paste the function into their own programs, which will allow decompression but unfortunately rarely will be found a compression function. This would therefore require someone who can read the decompression function in assembly language, convert it into a usable specification, and develop a compression algorithm for it - by no means an easy feat.

Worked Examples

In this section we will go through the process of cracking a format step by step, from opening it for the first time in a hex editor, to the final format presentation. To do this, we will cover some relatively easy formats. Let’s start off immediately, because the sooner you get to grips with it, the sooner you can start cracking your own!

Quake *.PAK

Games using the Quake (1 or 2) engine save their resources in archive with the *.pak extension. How did we know these file are archive? Well, when we encounter a new game, we usually look for any large files. A large file would be anything over about 10MB in size, usually, however some games choose to use many small archives rather than a few large ones. In the case of Quake, we really find only one large file, pak0.pak, and the title and extension clearly indicate it may be some kind of package. We recommend that you download the free Quake 2 demo, as the pak0.pak from the demo will be used in this tutorial.

Ad:\downloads\wikimages\Guide_To_Exploring_File_Formats_-_011_-_03.png jp [[File:## Error Converting ##]] File:Guide To Exploring File Formats - 011 - 03.png

Figure 8.1a: Start of the Quake 2 pak0.pak file in Hex Workshop.

Look for the basics

There are a few things that should be looked at when you first open an archive:

  • Is there a header tag?
  • Can you see any filenames?
  • Do you notice any common file data tags?

The first one is easy - just look at the first few bytes of the archive and see whether they say anything. Header tags are usually strings, and can be any number of bytes in length, but are often confined to 4 bytes.

The second one is very important - filenames are the easiest way to locate a directory, and lead us directly into the next phase of our investigation. If you cannot see any filenames near the beginning of an archive, go to the end of the archive and look there instead.

The last one is optional, but can help you identify the types of files in your archive, and some offsets to file data. If, for example, you noticed the word RIFF at offset 80, and also at offset 690, you would know that there are at least 2 *.wav audio files in the archive, and the first one has length 610. This information lets you quickly identify the fields in the directory that correspond to file offset and file size, thus making your job much easier.

Is there a header tag?

In this case, there is a header tag, which says PACK. This is how we know that Quake and Quake 2 both use the same archive format: they both have the same header tag. However, don’t assume that just because the header tags are identical, that they are the same format. For example, there are other games that have PACK as their header, and even use *.pak as their filename, but they are different to the PACK format here.

Can you see any filenames?

Looking at the start of the archive, we can’t see anything all that useful, especially filenames. Therefore, lets move to the end of the archive - maybe we can find some filenames there.

Nad:\downloads\wikimages\Guide_To_Exploring_File_Formats_-_011_-_04.png jp [[File:## Error Converting ##]] File:Guide To Exploring File Formats - 011 - 04.png

Figure 8.1.1.2a: A snippet from the end of Quake 2’s pak0.pak file.

Success, we have located several filename strings. The text models/weapons/v_machn/skin.pcx is quite obviously the name of a file, in this case it is an image.

Forming the directory entries

We have found the directory, so now what do we do? Some things that should be done include:

  • Are the directory entries a fixed length?
  • How are the filenames stored?
  • What other fields can we see?
Are the directory entries a fixed length?

It is very important to determine the length of a directory entry, and whether the length is fixed. After we determine that, we can start to address other issues like determining the other fields, and discover how many files are in an archive.

To find the length of each entry, we count the number of bytes between the start of the first filename, and the start of the second filename.

Nad:\downloads\wikimages\Guide_To_Exploring_File_Formats_-_011_-_05.png jp [[File:## Error Converting ##]] File:Guide To Exploring File Formats - 011 - 05.png

Figure 8.1.2.1a: A single directory entry

We can see that there are 64 bytes. We repeat the process with the next file, and see that it is still 64 bytes, even though the filename is a different length. Thus, the directory entries are a fixed length.

How are the filenames stored?

Now that we know the length of each directory entry is fixed, we can also assume that there is a maximum filename length which is also fixed. You can read about the filename storage formats in an earlier chapter. In this case, we can see that we have a filename, followed by a group of null bytes. If we measure the length between the start of the filename and the last null byte, we see the length is 56. If we check this against the second directory entry, we still get the answer 56, so our filename will have a maximum length of 56, and it is padded to this length using null bytes.

What other fields can we see?

Now we know that each directory entry is 64 bytes, and that the filename will always use 56 bytes, so we are left with 8 other bytes that we need to examine. Most fields in an archive will be 4 bytes long, and we can see that this is the case - there is 4 bytes with nulls towards the end, and another 4 bytes with nulls towards the end. The key to distinguishing the fields and their lengths are to look for the changes between a null and an actual byte value - when this occurs you can be pretty certain they are 2 different fields, as it is unusual to have a null byte in the middle of a field value.

So we have 2 4-byte fields we need to assign a purpose to. In an earlier chapter, we learnt that some of the common fields are file offset and file size, as well as ways to determine whether our field is one of these. So it is probably safe for us to assume that one of our 4-byte fields will be the offset to the file data, and the other will be the size.

Finding the purpose of the other fields

Ok so we now want to work out which field is the file offset, and which is the file size. It may also be that we have made a wrong assumption, so we can check this at the same time.

Now, most archive store files in the directory in the same order as the files in the archive. So, the first directory entry will probably have a very small file offset value, as it starts close to the beginning of the archive. Because we know this, the easiest way to determine the file offset field is to find the first directory entry, and then locate the smallest field value.

Scroll upwards through the archive until you find the first filename env/unit1_rt.pcx - somewhere around here will be the start of the directory. If we look at the 2 4-byte fields after the first filename, we can see that the first one is equal to 12, and the second is equal to 23086. 12 is a very small number, and as such it is very probable that this is the file offset. 23086 is also a pretty small number, so we will assume that it is the file size. We can check these assumptions by adding the 2 numbers together, and comparing it to the file offset for the second file.

Nad:\downloads\wikimages\Guide_To_Exploring_File_Formats_-_011_-_06.png jp [[File:## Error Converting ##]] File:Guide To Exploring File Formats - 011 - 06.png

Figure 8.1.3a: The first filename in the directory, which should be close to the start of the directory.

So, if our assumptions are correct, the second file offset should be 23086 + 12 = 23098. Looking at the first 4-byte field of the second directory entry, we can see that it has the value 23098, thus we have correctly identified both the file offset and the file size fields at the same time.

Filling in the blanks

Now we are close to completion, but there are a few things we need to tidy up. We know what the structure of the directory is, but we need to find the start of the directory. We also know that the first file starts at offset 12, so the first 12 bytes of the archive must have some purpose. Lets go back to the start of the archive and see what we can find out.

Nad:\downloads\wikimages\Guide_To_Exploring_File_Formats_-_011_-_07.png jp [[File:## Error Converting ##]] File:Guide To Exploring File Formats - 011 - 07.png

Figure 8.1.4a: At the start of the archive. The cursor is at offset 4. The Data Inspector show the interpreted values at this location.

The archive header has a length 12: we are told that because we know the first file starts at offset 12. So therefore, we can assume that the header probably has 3 4-byte fields in it, as 4-byte fields are the most common. We can check this easily: the first field is definitely 4 bytes, as it contains the header tag PACK. The other 2 4-byte fields will be determined through inspection.

If we look at the second 4-byte field, we are told it has a value 49880538, as shown by the < in the screenshot above (Figure 8.1.4a) . This value is large, and almost the same size as the archive itself. We know that the directory is at the end of the archive, so maybe this field value tells us the beginning of the directory. We check this by jumping to this offset, and we find that the cursor is at the start of the first filename - we have thus located the field for the Directory Offset.

Now that we know the start of the directory, we can work out the directory length, and the number of files. Our directory length is the length of the archive - the directory offset. In this case, it is 49951322 - 49880538 = 70784. We can take this one step further: we know that each directory entry has a length of 64, so the number of files in the archive is the directory length / the entry length, which is 70784 / 64 = 1106. Both the Number of Files and the Directory Offset values are possible candidates for our remaining 4-byte field in the archive header.

So we go back to the header and look at the value for our remaining 4-byte field: it has the value 70784, which we have just worked out as the Directory Size.

The final result

When we put all this information together, these are the final specifications for this archive format. Now all that’s left is to write a program to read and/or write these archives.

The format is a Directory archive, which is one of our patterns identified in an earlier chapter. The values for the archive in our example are shown in green. Also note the value 1106 - there are 1106 entries in the directory, and 1106 file data in the archive.

Archive Header  

4 - Header Tag (String) PACK
4 - Directory Offset 49880538
4 - Directory Length 70784   File Data  

File Data 1  

 

X - File Data the data for the file env/unit1_rt.pcx  

File Data 2  

 

X - File Data  

…  

File Data 1106  

 

X - File Data   Directory  

File Entry 1  

 

56 - Filename (null terminated) env/unit1_rt.pcx
4 - File Offset 12
4 - File Size 23098  

File Entry 2  

 

56 - Filename (null terminated)
4 - File Offset
4 - File Size  

…  

File Entry 1106  

 

56 - Filename (null terminated)
4 - File Offset
4 - File Size

This format does not store the number of files, so you will always need to do a calculation of Directory Length / 64 if you need to know the number of files, such as to set up memory for your archive reader.

Appendix

Binary  Byte Number Table

Value

Bit 7 (27)

Bit 6 (26)

Bit 5 (25)

Bit 4 (24)

Bit 3 (23)

Bit 2 (22)

Bit 1 (21)

Bit 0 (20) 0

0 0 0 0 0 0 0 0 1

0 0 0 0 0 0 0 1 2

0 0 0 0 0 0 1 0 3

0 0 0 0 0 0 1 1 4

0 0 0 0 0 1 0 0 5

0 0 0 0 0 1 0 1 6

0 0 0 0 0 1 1 0 7

0 0 0 0 0 1 1 1 8

0 0 0 0 1 0 0 0 9

0 0 0 0 1 0 0 1 10

0 0 0 0 1 0 1 0 11

0 0 0 0 1 0 1 1 12

0 0 0 0 1 1 0 0 13

0 0 0 0 1 1 0 1 14

0 0 0 0 1 1 1 0 15

0 0 0 0 1 1 1 1 16

0 0 0 1 0 0 0 0 17

0 0 0 1 0 0 0 1 18

0 0 0 1 0 0 1 0 19

0 0 0 1 0 0 1 1 20

0 0 0 1 0 1 0 0 21

0 0 0 1 0 1 0 1 22

0 0 0 1 0 1 1 0 23

0 0 0 1 0 1 1 1 24

0 0 0 1 1 0 0 0 25

0 0 0 1 1 0 0 1 26

0 0 0 1 1 0 1 0 27

0 0 0 1 1 0 1 1 28

0 0 0 1 1 1 0 0 29

0 0 0 1 1 1 0 1 30

0 0 0 1 1 1 1 0 31

0 0 0 1 1 1 1 1 32

0 0 1 0 0 0 0 0 33

0 0 1 0 0 0 0 1 34

0 0 1 0 0 0 1 0 35

0 0 1 0 0 0 1 1 36

0 0 1 0 0 1 0 0 37

0 0 1 0 0 1 0 1 38

0 0 1 0 0 1 1 0 39

0 0 1 0 0 1 1 1 40

0 0 1 0 1 0 0 0 41

0 0 1 0 1 0 0 1 42

0 0 1 0 1 0 1 0 43

0 0 1 0 1 0 1 1 44

0 0 1 0 1 1 0 0 45

0 0 1 0 1 1 0 1 46

0 0 1 0 1 1 1 0 47

0 0 1 0 1 1 1 1 48

0 0 1 1 0 0 0 0 49

0 0 1 1 0 0 0 1 50

0 0 1 1 0 0 1 0 51

0 0 1 1 0 0 1 1 52

0 0 1 1 0 1 0 0 53

0 0 1 1 0 1 0 1 54

0 0 1 1 0 1 1 0 55

0 0 1 1 0 1 1 1 56

0 0 1 1 1 0 0 0 57

0 0 1 1 1 0 0 1 58

0 0 1 1 1 0 1 0 59

0 0 1 1 1 0 1 1 60

0 0 1 1 1 1 0 0 61

0 0 1 1 1 1 0 1 62

0 0 1 1 1 1 1 0 63

0 0 1 1 1 1 1 1 64

0 1 0 0 0 0 0 0 65

0 1 0 0 0 0 0 1 66

0 1 0 0 0 0 1 0 67

0 1 0 0 0 0 1 1 68

0 1 0 0 0 1 0 0 69

0 1 0 0 0 1 0 1 70

0 1 0 0 0 1 1 0 71

0 1 0 0 0 1 1 1 72

0 1 0 0 1 0 0 0 73

0 1 0 0 1 0 0 1 74

0 1 0 0 1 0 1 0 75

0 1 0 0 1 0 1 1 76

0 1 0 0 1 1 0 0 77

0 1 0 0 1 1 0 1 78

0 1 0 0 1 1 1 0 79

0 1 0 0 1 1 1 1 80

0 1 0 1 0 0 0 0 81

0 1 0 1 0 0 0 1 82

0 1 0 1 0 0 1 0 83

0 1 0 1 0 0 1 1 84

0 1 0 1 0 1 0 0 85

0 1 0 1 0 1 0 1 86

0 1 0 1 0 1 1 0 87

0 1 0 1 0 1 1 1 88

0 1 0 1 1 0 0 0 89

0 1 0 1 1 0 0 1 90

0 1 0 1 1 0 1 0 91

0 1 0 1 1 0 1 1 92

0 1 0 1 1 1 0 0 93

0 1 0 1 1 1 0 1 94

0 1 0 1 1 1 1 0 95

0 1 0 1 1 1 1 1 96

0 1 1 0 0 0 0 0 97

0 1 1 0 0 0 0 1 98

0 1 1 0 0 0 1 0 99

0 1 1 0 0 0 1 1 100

0 1 1 0 0 1 0 0 101

0 1 1 0 0 1 0 1 102

0 1 1 0 0 1 1 0 103

0 1 1 0 0 1 1 1 104

0 1 1 0 1 0 0 0 105

0 1 1 0 1 0 0 1 106

0 1 1 0 1 0 1 0 107

0 1 1 0 1 0 1 1 108

0 1 1 0 1 1 0 0 109

0 1 1 0 1 1 0 1 110

0 1 1 0 1 1 1 0 111

0 1 1 0 1 1 1 1 112

0 1 1 1 0 0 0 0 113

0 1 1 1 0 0 0 1 114

0 1 1 1 0 0 1 0 115

0 1 1 1 0 0 1 1 116

0 1 1 1 0 1 0 0 117

0 1 1 1 0 1 0 1 118

0 1 1 1 0 1 1 0 119

0 1 1 1 0 1 1 1 120

0 1 1 1 1 0 0 0 121

0 1 1 1 1 0 0 1 122

0 1 1 1 1 0 1 0 123

0 1 1 1 1 0 1 1 124

0 1 1 1 1 1 0 0 125

0 1 1 1 1 1 0 1 126

0 1 1 1 1 1 1 0 127

0 1 1 1 1 1 1 1 128

1 0 0 0 0 0 0 0 129

1 0 0 0 0 0 0 1 130

1 0 0 0 0 0 1 0 131

1 0 0 0 0 0 1 1 132

1 0 0 0 0 1 0 0 133

1 0 0 0 0 1 0 1 134

1 0 0 0 0 1 1 0 135

1 0 0 0 0 1 1 1 136

1 0 0 0 1 0 0 0 137

1 0 0 0 1 0 0 1 138

1 0 0 0 1 0 1 0 139

1 0 0 0 1 0 1 1 140

1 0 0 0 1 1 0 0 141

1 0 0 0 1 1 0 1 142

1 0 0 0 1 1 1 0 143

1 0 0 0 1 1 1 1 144

1 0 0 1 0 0 0 0 145

1 0 0 1 0 0 0 1 146

1 0 0 1 0 0 1 0 147

1 0 0 1 0 0 1 1 148

1 0 0 1 0 1 0 0 149

1 0 0 1 0 1 0 1 150

1 0 0 1 0 1 1 0 151

1 0 0 1 0 1 1 1 152

1 0 0 1 1 0 0 0 153

1 0 0 1 1 0 0 1 154

1 0 0 1 1 0 1 0 155

1 0 0 1 1 0 1 1 156

1 0 0 1 1 1 0 0 157

1 0 0 1 1 1 0 1 158

1 0 0 1 1 1 1 0 159

1 0 0 1 1 1 1 1 160

1 0 1 0 0 0 0 0 161

1 0 1 0 0 0 0 1 162

1 0 1 0 0 0 1 0 163

1 0 1 0 0 0 1 1 164

1 0 1 0 0 1 0 0 165

1 0 1 0 0 1 0 1 166

1 0 1 0 0 1 1 0 167

1 0 1 0 0 1 1 1 168

1 0 1 0 1 0 0 0 169

1 0 1 0 1 0 0 1 170

1 0 1 0 1 0 1 0 171

1 0 1 0 1 0 1 1 172

1 0 1 0 1 1 0 0 173

1 0 1 0 1 1 0 1 174

1 0 1 0 1 1 1 0 175

1 0 1 0 1 1 1 1 176

1 0 1 1 0 0 0 0 177

1 0 1 1 0 0 0 1 178

1 0 1 1 0 0 1 0 179

1 0 1 1 0 0 1 1 180

1 0 1 1 0 1 0 0 181

1 0 1 1 0 1 0 1 182

1 0 1 1 0 1 1 0 183

1 0 1 1 0 1 1 1 184

1 0 1 1 1 0 0 0 185

1 0 1 1 1 0 0 1 186

1 0 1 1 1 0 1 0 187

1 0 1 1 1 0 1 1 188

1 0 1 1 1 1 0 0 189

1 0 1 1 1 1 0 1 190

1 0 1 1 1 1 1 0 191

1 0 1 1 1 1 1 1 192

1 1 0 0 0 0 0 0 193

1 1 0 0 0 0 0 1 194

1 1 0 0 0 0 1 0 195

1 1 0 0 0 0 1 1 196

1 1 0 0 0 1 0 0 197

1 1 0 0 0 1 0 1 198

1 1 0 0 0 1 1 0 199

1 1 0 0 0 1 1 1 200

1 1 0 0 1 0 0 0 201

1 1 0 0 1 0 0 1 202

1 1 0 0 1 0 1 0 203

1 1 0 0 1 0 1 1 204

1 1 0 0 1 1 0 0 205

1 1 0 0 1 1 0 1 206

1 1 0 0 1 1 1 0 207

1 1 0 0 1 1 1 1 208

1 1 0 1 0 0 0 0 209

1 1 0 1 0 0 0 1 210

1 1 0 1 0 0 1 0 211

1 1 0 1 0 0 1 1 212

1 1 0 1 0 1 0 0 213

1 1 0 1 0 1 0 1 214

1 1 0 1 0 1 1 0 215

1 1 0 1 0 1 1 1 216

1 1 0 1 1 0 0 0 217

1 1 0 1 1 0 0 1 218

1 1 0 1 1 0 1 0 219

1 1 0 1 1 0 1 1 220

1 1 0 1 1 1 0 0 221

1 1 0 1 1 1 0 1 222

1 1 0 1 1 1 1 0 223

1 1 0 1 1 1 1 1 224

1 1 1 0 0 0 0 0 225

1 1 1 0 0 0 0 1 226

1 1 1 0 0 0 1 0 227

1 1 1 0 0 0 1 1 228

1 1 1 0 0 1 0 0 229

1 1 1 0 0 1 0 1 230

1 1 1 0 0 1 1 0 231

1 1 1 0 0 1 1 1 232

1 1 1 0 1 0 0 0 233

1 1 1 0 1 0 0 1 234

1 1 1 0 1 0 1 0 235

1 1 1 0 1 0 1 1 236

1 1 1 0 1 1 0 0 237

1 1 1 0 1 1 0 1 238

1 1 1 0 1 1 1 0 239

1 1 1 0 1 1 1 1 240

1 1 1 1 0 0 0 0 241

1 1 1 1 0 0 0 1 242

1 1 1 1 0 0 1 0 243

1 1 1 1 0 0 1 1 244

1 1 1 1 0 1 0 0 245

1 1 1 1 0 1 0 1 246

1 1 1 1 0 1 1 0 247

1 1 1 1 0 1 1 1 248

1 1 1 1 1 0 0 0 249

1 1 1 1 1 0 0 1 250

1 1 1 1 1 0 1 0 251

1 1 1 1 1 0 1 1 252

1 1 1 1 1 1 0 0 253

1 1 1 1 1 1 0 1 254

1 1 1 1 1 1 1 0 255

1 1 1 1 1 1 1 1

ASCII Code Table

Standard

Byte

Hex

Char

Byte

Hex

Char

Byte

Hex

Char

Byte

Hex

Char 0

00

Null

32

20

Space

64

40

@

96

60

` 1

01

Start of Header

33

21

 !

65

41

A

97

61

a 2

02

Start of Text

34

22

"

66

42

B

98

62

b 3

03

End of Text

35

23

#

67

43

C

99

63

c 4

04

End of Transmission

36

24

$

68

44

D

100

64

d 5

05

Enquiry

37

25

 %

69

45

E

101

65

e 6

06

Acknowledgement

38

26

&

70

46

F

102

66

f 7

07

Bell

39

27

'

71

47

G

103

67

g 8

08

Backspace

40

28

(

72

48

H

104

68

h 9

09

Horizontal Tab

41

29

)

73

49

I

105

69

i 10

0A

Line feed

42

2A

*

74

4A

J

106

6A

j 11

0B

Vertical Tab

43

2B

+

75

4B

K

107

6B

k 12

0C

Form Feed

44

2C

,

76

4C

L

108

6C

l 13

0D

Carriage return

45

2D

-

77

4D

M

109

6D

m 14

0E

Shift Out

46

2E

.

78

4E

N

110

6E

n 15

0F

Shift In

47

2F

/

79

4F

O

111

6F

o 16

10

Data Link Escape

48

30

0

80

50

P

112

70

P 17

11

Device Control 1

49

31

1

81

51

Q

113

71

q 18

12

Device Control 2

50

32

2

82

52

R

114

72

r 19

13

Device Control 3

51

33

3

83

53

S

115

73

s 20

14

Device Control 4

52

34

4

84

54

T

116

74

t 21

15

Negative Acknowledgement

53

35

5

85

55

U

117

75

u 22

16

Synchronous idle

54

36

6

86

56

V

118

76

v 23

17

End of Transmission Block

55

37

7

87

57

W

119

77

w 24

18

Cancel

56

38

8

88

58

X

120

78

x 25

19

End of Medium

57

39

9

89

59

Y

121

79

y 26

1A

Substitution

58

3A

 :

90

5A

Z

122

7A

z 27

1B

Escape

59

3B

 ;

91

5B

[

123

7B

{ 28

1C

File Separator

60

3C

<

92

5C

\

124

7C

| 29

1D

Group Separator

61

3D

=

93

5D

]

125

7D

} 30

1E

Record Separator

62

3E

>

94

5E

^

126

7E

~ 31

1F

Unit Separator

63

3F

 ?

95

5F

_

127

7F

Delete)

Extended

Byte

Hex

Char

Byte

Hex

Char

Byte

Hex

Char

Byte

Hex

Char 128

80

Ç

160

A0

á

192

C0

ââ€â€

224

E0

Ó 129

81

ü

161

A1

í

193

C1

ââ€Â´

225

E1

ß 130

82

é

162

A2

ó

194

C2

ââ€Â¬

226

E2

Æ131

83

â

163

A3

ú

195

C3

ââ€Å“

227

E3

Ã’ 132

84

ä

164

A4

ñ

196

C4

ââ€â‚¬

228

E4

õ 133

85

à

165

A5

Ñ

197

C5

ââ€Â¼

229

E5

Õ 134

86

Ã¥

166

A6

ª

198

C6

ã

230

E6

µ 135

87

ç

167

A7

º

199

C7

Ã

231

E7

þ 136

88

ê

168

A8

¿

200

C8

â•š

232

E8

Þ 137

89

ë

169

A9

®

201

C9

â•â€

233

E9

Ú 138

8A

è

170

AA

¬

202

CA

â•©

234

EA

Û 139

8B

ï

171

AB

½

203

CB

╦

235

EB

Ù 140

8C

î

172

AC

¼

204

CC

â• 

236

EC

ý 141

8D

ì

173

AD

¡

205

CD

â•Â

237

ED

à142

8E

Ä

174

AE

«

206

CE

╬

238

EE

¯ 143

8F

Ã…

175

AF

»

207

CF

¤

239

EF

´ 144

90

É

176

B0

â–‘

208

D0

ð

240

F0

­ 145

91

æ

177

B1

â–’

209

D1

ÃÂ

241

F1

± 146

92

Æ

178

B2

â–“

210

D2

Ê

242

F2

‗ 147

93

ô

179

B3

ââ€â€š

211

D3

Ë

243

F3

¾ 148

94

ö

180

B4

ââ€Â¤

212

D4

È

244

F4

¶ 149

95

ò

181

B5

ÃÂ

213

D5

ı

245

F5

§ 150

96

û

182

B6

Â

214

D6

ÃÂ

246

F6

÷ 151

97

ù

183

B7

À

215

D7

ÃŽ

247

F7

¸ 152

98

ÿ

184

B8

©

216

D8

ÃÂ

248

F8

° 153

99

Ö

185

B9

â•£

217

D9

ââ€Ëœ

249

F9

¨ 154

9A

Ü

186

BA

â•‘

218

DA

ââ€Å’

250

FA

• 155

9B

ø

187

BB

â•—

219

DB

â–ˆ

251

FB

¹ 156

9C

£

188

BC

â•Â

220

DC

â–„

252

FC

³ 157

9D

Ø

189

BD

¢

221

DD

¦

253

FD

² 158

9E

×

190

BE

Â¥

222

DE

Ì

254

FE

â–  159

9F

Æ’

191

BF

ââ€Â

223

DF

â–€

255

FF

 

Useful References

Collaboration of format specifications for as many proprietary file formats as possible. Originally started by WATTO and Mike Zuurman (the authors or this book), there are literally thousands of formats that have been analysed for you, making it much easier to write your own programs for them.

The website is a Wiki, so anyone is free to add their own information into the website, and make corrections. Definitely the first place to visit when looking for unknown file format specification documents.

The website of co-author Mike Zuurman, and home to Multi-Ex Commander. Multi-Ex Commander is a Windows-based program that can open and manipulate many hundreds of game archives, through use of its own specialist scripting language. Nice, easy to use, and is great for replacing files in existing archives.

The website of co-author WATTO, and home to Game Extractor. Game Extractor is a game archive viewer/editor that can be run on any platform, and supports a host of different game formats. Includes things such as image viewers and converters, a hex viewer, and opens archives from a bunch of different gaming consoles including XBOX and PS2.

A huge collection of file format specifications. The specifications are often taken from the company developing or maintaining the format, so the material is reliable. Contains specifications for all types of files including sounds, images, text, archives, and executables. Unfortunately this site hasn’t been maintained for the last few years, so you won’t find anything about relatively new formats, but you will find plenty about formats prior to 2003.

Some Common File Format Tags

Tag

Extension

Type

Description BM

*.bmp *.dib

Image

Microsoft-standard Bitmap image
http://www.microsoft.com CWS

*.swf

Animation

Macromedia Flash animation (Compressed)
http://www.macromedia.com/flash SDD

*.dds

Image

Microsoft-standard Direct-X image
http://www.microsoft.com/directx FWS

*.swf

Animation

Macromedia Flash animation (Uncompressed)
http://www.macromedia.com/flash GIF

*.gif

Image

GIF image
http://www.compuserve.com MSCF

*.cab

Archive

Microsoft Cabinet archive
http://www.microsoft.com MZ

*.exe *.dll

Executable

Windows executable program application
http://www.microsoft.com %PDF

*.pdf

Document

Standard Adobe PDF document
http://www.adobe.com PK

*.zip *.gz

Archive

Standard ZIP/GZip archive
http://www.pkzip.com Rar!

*.rar

Archive

RAR Archive
http://www.rarsoft.com RIFF

*.wav

Audio

Microsoft-standard audio file
http://www.microsoft.com

Legal Information

This book aims to help programmers gain an understanding and appreciation of the various file formats in use today. Using this knowledge will help programmers develop their own formats, and increase support for different file formats in their own programs.

This book supports the exploration of file formats through examination and exploration. None of the information presented can be used to perform illegal functions such as hacking or cracking of games - this is not what the book is about and requires a totally different approach to exploring file formats. We encourage the exploration of any file, so long as the exploration is for your own benefit only, and will not be used or distributed in an attempt to do anything illegal, including hacking of files, bypassing legal or copyright measures introduced into a file, or for use against a company or individual. If your exploration is for your own benefit and use, we fully support you – there is nothing illegal about exploring the files on your own computer in your own access. If you do wish to use the information you have gained through exploration, make sure you check for any licensing issues, trademarks, or copyrights that may be associated with the format – otherwise you could end up with major fines and criminal charges.

This book, and the authors, do not encourage or support the exploration of copyright or otherwise protected material for any purpose. The reading of this material does not grant you permission to modify or distribute information contained in any file that is not of your own creation.

This book, and the authors, do not support and are not affiliated with any game, program, company, website, copyright, or trademark that is used within. All copyrights, trademarks, and similar rights are used for identification purposes only. All rights reserved.

This book must not be duplicated, altered, distributed, or used for anything other than personal use. You have permission to make 1 printed or electronic backup of this book for use by your individual only. Please contact the authors if you wish to distribute or otherwise use the book for purposes other than personal - we will usually be quite happy to let you use it.

Thanks, and happy reading!

WATTO and Mike.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK