Learn more about Israeli genocide in Gaza, funded by the USA, Germany, the UK and others.

Run-length encoding in C

Question 1.5 of Cracking the Coding Interview:

Implement a method to perform basic string compression using the counts of repeated characters. For example, the string aabcccccaaa would become a2b1c5a3. If the “compressed” string would not become smaller than the original string, your method should return the original string. You can assume the string has only upper and lower case letters (a-z).

The question doesn’t name it, but this is run-length encoding. (The exception for longer strings is not traditional run-length encoding, however. It’s also a rather ugly API.)

Here’s a solution in C. An interesting property of the C implementation is that it needs to calculate the length of the returned string before it actually generates that string, in order to malloc a string of the appropriate size. The algorithm consists of two passes which look very similar, but where the second pass actually adds characters to the string, the first pass just increments a string length counter.

A handy API for this is snprintf(NULL, 0, ...), which returns the length of the string that sprintf would have printed if given a real string to print to.

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <assert.h>

int run_length(char* s) {
  char c = s[0];
  if (c == '\0') return 0;
  int length = 1;
  for(; s[length] == c; length++);
  return length;
}

int strlen_i(int i) {
  return snprintf(NULL, 0, "%i", i);
}

int rle_len(char* s) {
  int len = 0;
  int rl = run_length(s);
  while (rl > 0) {
    len += 1 + strlen_i(rl);
    s += rl;
    rl = run_length(s);
  }
  return len;
}

char* rle(char* s) {
  int old_len = strlen(s);
  int new_len = rle_len(s);

  if (old_len <= new_len) {
    return strdup(s); // provides free()able copy
  }

  char* out = malloc((new_len+1) * sizeof(char));

  int out_next = 0;
  int rl = run_length(s);
  while (rl > 0) {
    out_next += sprintf(&out[out_next], "%c%i", s[0], rl);
    s += rl;
    rl = run_length(s);
  }
  out[out_next] = '\0';

  return out;
}

void test_rle_len(char* input, int expected) {
  assert(rle_len(input) == expected);
}

void test_rle(char* input, char* expected) {
  char* output = rle(input);
  assert(strcmp(output, expected) == 0);
  free(output);
}

int main(int argc, char** argv) {
  test_rle_len("aabcccccaaa", 8);
  test_rle_len("aaaaaaaaaabbbbbbbbbbbbbbbbbbbb", 6);
  test_rle_len("", 0);
  test_rle("aabcccccaaa", "a2b1c5a3");
  test_rle("aaaaaaaaaabbbbbbbbbbbbbbbbbbbb", "a10b20");
  test_rle("", "");
  test_rle("abcd", "abcd");
  return 0;
}

This implementation is constant-time in the length of the input, which is optimal.

Tagged #ctci, #programming, #c.

Similar posts

More by Jim

Want to build a fantastic product using LLMs? I work at Granola where we're building the future IDE for knowledge work. Come and work with us! Read more or get in touch!

This page copyright James Fisher 2020. Content is not associated with my employer. Found an error? Edit this page.