Some H2D2 progress, whilst messing with Run Length Encoding

So progress on my programming language, still codenamed H2D2 continues on background threads :-) I've not given up, but I haven't made many posts on the blog about it recently. It was probably getting boring anyway, and recent work has been tidying things up and writing unit tests - which does not make for interesting reading.

One thing that I wanted to try, was to compress the H2D2 bytecode so that it doesn't take as much time to serialise a running program between machines. I believe that Java JAR files are zipped, so I thought that I'd do something similar with H2D2. But I also wanted something that I could write myself that didn't take an age. Since I'm doing this for fun, it's nice to start with a blank page and go from there. So anyway, my web wanderings brought me to here. That page talks about Run Length Encoding - something that I implemented once myself back in the 1980s (or maybe the early 90s) - not that I have the source code anymore. But I read about the RLE PackBits algorithm and thought it might be a good way to go. Not a massive processing overhead and reasonably simple to implement. I chose not to look at the example souce code provided ... and just to write it from scratch myself. It sounded like fun.

It took me about an hour whilst drinking my morning coffee to write the compression code, and an hour or so in the evening to decompress the data back to the original values (I also wrote some unit tests to check for any obvious problems I could see). It seems to work OK: when I compress H2D2 bytecode I'm getting about a 30% reduction. It's not as good as zip compression obviously, but it will do for now.

I'm sure my code won't win any prizes for optimisation or style, but at least it works. I guess it might also come in handy if you were trying to squash some data on a microprocessor without needing a massive library or lots of processing power. So here is an example C program:

// Example RLE compression with PackBits
// ...just an example, not production code :-)

#include <stdio.h>
#include <string.h> // for memset

void fgrow(char*, char*);
void fshrink(char*, char*);

int main(int argc, char *argv[])
{
  if (argc==4 && strlen(argv[1])==2 && argv[1][0]=='/')
  {
    switch (argv[1][1])
    {
      case 'c':
        printf("Compressing... ");
        fshrink(argv[2], argv[3]);
        printf("done!\r\n");
        return 0;

      case 'u':
        printf("Uncompressing... ");
        fgrow(argv[2], argv[3]);
        printf("done!\r\n");
        return 0;
    }
  }
  
  printf("Usage: RLE { /c | /u }  \r\n");
  printf("   use the /c switch to compress or the");
  printf(" /u switch to uncompress\r\n");
  return 1;
}

void fgrow(char* inpath, char* outpath)
{
  char ch=0, hdr=0;
  FILE* op = fopen(outpath, "wb");
  FILE* mp = fopen(inpath, "rb");

  while (!feof(mp))
  {
    hdr=fgetc(mp);
    if (feof(mp)) break;

    if (hdr>=0) // copy bytes verbatim
    {
      for (; hdr>=0; hdr--)
      {
        ch = fgetc(mp);
        fputc(ch, op);
      }
    }
    else // copy a run of bytes
    {
      ch=fgetc(mp);
      for (; hdr<=0; hdr++)
      {
        fputc(ch, op);
      }
    }
  }
  fclose(mp);
  fclose(op);
}

void fshrink(char* inpath, char* outpath)
{
  char buf[128];
  memset(buf, 0, 128);
  int n=0, hdr=0, runlength=0;
  FILE* op = fopen(outpath, "wb");
  FILE *mp = fopen(inpath, "rb");

  while (!feof(mp))
  {
    buf[n++] = fgetc(mp);
    if (feof(mp))
    {
      n--;
      break;
    }

    if (n==128)
    {
      // buffer is full
      hdr=127;
      fputc(hdr, op);
      fwrite(buf, 1, 128, op);
      memset(buf, 0, 128);
      n=0;
      continue;
    }

    if (n >= 3 && buf[n-1]==buf[n-2] && buf[n-2]==buf[n-3])
    {
      // last three bytes are the same
      hdr = n-4;
      char symbol = buf[n-1];
      char tmp = symbol;
      if (n > 3)
      {
        fputc(hdr, op);
        fwrite(buf, 1, n-3, op);
      }
      runlength=3;

      while(!feof(mp) && tmp==symbol)
      {
        tmp = fgetc(mp);
        n++;
        runlength++;
      }
      hdr = 2-runlength;
      fputc(hdr, op);
      fputc(symbol, op);

      // put the different byte back for next read
      if (!feof(mp)) ungetc(tmp, mp);
      memset(buf, 0, 128); 
      n=0;
    }
  }

  // flush the buffer if not empty...
  if (n > 0)
  {
    hdr = n-1;
    fputc(hdr, op);
    fwrite(buf, 1, n, op);
  }

  fclose(mp);
  fclose(op);
}

Obviously this compression will only shrink files which have runs of the same repeated byte value in them. Otherwise it may increase the file size. You can try it from the command line and pass /c as the first argument to compress a file and then use /u to decompress the file back. The next two arguments need to be the names of the input and output files respectively. This example doesn't have any error trapping, so if the file fails to open, or something like that it will go bang. But that would not be hard to add.

Of course, in my H2D2 code, I'm using the same algorithm, but I've adapted the code to work on blocks of memory rather than on files. But in H2D2 I have made a VFILE struct to work alongside the C FILE pointers and I've added some functions that work the same as the C IO functions but taking my VFILE pointer instead. This means that I can easily swap code to operate either on disk or in memory...