6

Typeless programming languages (BCPL, B), C evolution and decompilation

 3 years ago
source link: https://yurichev.com/blog/typeless/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Typeless programming languages (BCPL, B), C evolution and decompilation

Typeless programming languages (BCPL, B), C evolution and decompilation

DCC decompiler by Cristina Cifuentes

The early DCC decompiler by Cristina Cifuentes produces results in C-like code in the files with .B extension.

Here is example:

/*
 * Input file   : STRLEN.EXE
 * File type    : EXE
 */
#include "dcc.h"


void proc_1 (int arg0)
/* Takes 2 bytes of parameters.
 * High-level language prologue code.
 * C calling convention.
 */
{
int loc1;

    loc1 = 0;
    arg0 = (arg0 + 1);

    while ((*arg0 != 0)) {
        loc1 = (loc1 + 1);
        arg0 = (arg0 + 1);
    }   /* end of while */
}


void main ()
/* Takes no parameters.
 */
{
int loc1;
    loc1 = 404;
    proc_1 (loc1);
}

Perhaps, she kept in mind B programming language?

B programming language

B programming language was developed by Ken Thompson and Dennis Ritchie before they work on C. In essence, B language is typeless C language.

Here is B code snippet:

strcopy(sl,s2)
{
	auto i;
	i = 0;
	while (lchar(sl,i,char(s2,i)) != '*e') i++;
}

Very similar to C, but there are no types in function definition. Local variables are declared with auto keyword.

All arguments and variables has just one possible type -- CPU register or word in old computers environment, or int in C lingo.

As far as I right, B language was used in UNIX v2.

Strings handling in B

String handling in B is tricky, since B has no idea of bytes. So each 4 characters are packed into one register (or word). "Hello world" program is then (if I correct, I have no B compiler):

main()
{
 putchar('hell'); putchar('o, w'); putchar('orld'); putchar('!*n');
}

putchar() prints all 4 characters in input word. If you need to print 1 or 2 or 3 characters packed in word, the word is padded by zero bytes.

Here is the function to get character at some index from a vector of words:

char(s, n)
{
	auto y,sh,cpos;
	y = s[n/4];        /* word containing n-th char */
	cpos = n%4;        /* position of char in word */
	sh = 27-9*cpos;    /* bit positions to shift */
	y =  (y>>sh)&0777; /* shift and mask all but 9 bits */
	return(y);         /* return character to caller */
}

The code snippet is taken from this article written by B.W.Kernighan, and since we see that 9 bits are allocated for character, and 4 characters were packed in a word, this code was intended for 36-bit computer. As I understand, 36-bit Honeywell 6000 series is meant.

As Dennis Ritchie states, C was developed to overcome limits of typeless variables, first to make string handling easier and also to handle floating point variables.

B’s heritage in C

Amusingly, latest GCC still can compile B code. I tried this and GCC compiled in, treating all types as int:

f(a, b, c)
{
	return a+b+c;
};

GCC compiles even this:

f1(a, b, c)
{
	auto tmp;
	tmp=a+b;
	return tmp+c;
};

It was an oddity to C learners in past, no one could understand auto keyword. C textbooks are also omitted explanations. But it seems, it's just heritage of B. GCC treats auto just as int.

Why B has auto keyword? Well, if to replace auto to static, the variable will be declared as global variable instead of to be placed in the stack. This is still true for latest C/C++ standards as well. So auto means that these variables are to be placed in the stack.

Apparently, all this stuff were in C to ease porting from B source code? Or C was just typed B at the time, like C++ is C with classes?

And it's still possible in C/C++ to pack 4 (or less) characters into a word:

int a='test';

Looks like unique feature to C/C++?

K&R C syntax

Sometimes, in ancient C code, we can find a function definitions, where argument types are enumerated after the first line:

f2(a, b, c)
char a;
char b;
{
	return a+b+c;
};

That still compiles by the latest GCC: a and b are treated as arguments of char type and c still has default int type.

Perhaps, K&R C function definition syntax is appeared when programmers ported B code to C and just supplied each function arguments by corresponding data types. Looks clumsy, so late ANSI C standard allows much more familiar definitions:

f2(char a, char b, int c)
{
	return a+b+c;
};

Hungarian notation

When you write a lot in typeless languages (including assembly language), you need to keep track, which variable has which type. Apparently, Hungarian notation was heavily used here:

The Hungarian notation was developed to help programmers avoid inadvertent type errors.

( https://en.wikipedia.org/wiki/BCPL )

Well, maybe not in this precise form, but you've got the idea: variable (and function) name can also encode its data type.

Hungarian notation in typed languages (C/C++)

It was (or still?) heavily used by Microsoft. Perhaps (I'm not sure) it was attempt to improve C/C++ code readability?

Can B still be used today?

The B language is even simpler than C. Maybe it can be still used on cheap CPUs with no byte-level instructions?

Maybe it can be used for teaching: many toy-level compiler writers first start at typeless C-like compilers.

Decompiler writers are also start here, at typeless C-like languages.

Manual decompilataion and typeless languages

When you decompile some piece of machine code manually, you can think of CPU registers as temporary typeless variables. Hungarian notation can also be used heavily in IDA, to keep track, what variable data type has a variable in stack or everywhere else.

Links


→ [list of blog posts]


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK