CodeCast - Code in Videos

A few months ago I had this wild idea: “how hard would it be to code videos of code”. Basically: instead of screen-casting your IDE, build a dedicated tool designed to render code as videos. Allowing for animated highlighting, and all the sort of things that people typically add with AfterEffects and similar video editing software. A quick search uncovered the jCodec project, and I figured that Kotlin is an almost perfect language to define a nice “video DSL” in, so over a weekend I set to work writing a tool that I nicnamed “CodeCast”. This evening, I’m publishing the code to GitHub so that it’s not just living on my machine anymore.

Now, full disclosure: CodeCast is a bodge job. It’s not pretty on the inside, it’s DSL is incomplete, and it’s very limited in what it can do. That out of the way: it works.

There are possibly better toolkits to use, but a pure JVM project makes it easy to find libraries (like jCodec) to do the video encoding and everything else that comes with this sort of project. Kotlin isn’t just great at building DSL’s, it can also be used as a scripting language (via the javax.script API), and while I’m not using that feature yet (the scenes are hard-coded) it’s not a hard feature to add in.

If you like the idea of an extensible tool designed for programming videos about programming, take a quick look and maybe try writing a script of your own. If I have time over the next few months, I’ll see about adding some more features.

Suspend extensions for Vert.x Database

I was recently working with Vert.x building a project (for fun) which involved storing and retrieving data from PostgreSQL. Vert.x is an absolutely amazing project and a wonderful way to bring the best of async programming to the JavaVM, but you can quickly end up with something approaching the “callback hell” of early Node.JS code. Using the Future class helps a lot, but since I was programming in Kotlin I decided to add my own thin database layer using Coroutines.

Coroutines (suspend functions in Kotlin) allow you to build amazingly complex state-machines transparently. In the simplest terms: they allow you to write code that appears to be “blocking”, but is actually a series of callbacks and state-machines. In Kotlin this is especially powerful since unlike most languages: the scheduling of the coroutines is decoupled from the language. This means that things like generator functions (function* in JavaScript) are implemented in the API rather than the compiler or runtime environment.

To make this simpler I used the excellent Vert.x / Kotlin module: https://github.com/vert-x3/vertx-lang-kotlin which has some simple “coroutine friendly” utilities for Vert.x. I also decided to tie the coroutine face to database transactions, since it’s normal that any database actions that rely on ordering are also executed within a transaction. At it’s simplest the class is simply a list of suspend functions that delegate to Vert.x’s “normal” SQLClient functions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class AsyncSQLConnection(val sqlConnection: SQLConnection) {
suspend inline fun <R> transaction(block: AsyncSQLConnection.() -> R): R {
setAutoCommit(false)
try {
val result = block()
commit()
return result
} catch (err: Throwable) {
rollback()
throw err
} finally {
setAutoCommit(true)
}
}

suspend fun setAutoCommit(b: Boolean) = awaitResult<Void> { sqlConnection.setAutoCommit(b, it) }
suspend fun execute(sql: String) = awaitResult<Void> { sqlConnection.execute(sql, it) }

suspend fun query(sql: String) = awaitResult<ResultSet> { sqlConnection.query(sql, it) }
suspend fun query(sql: String, params: JsonArray) = awaitResult<ResultSet> { sqlConnection.queryWithParams(sql, params, it) }

suspend fun querySingle(sql: String) = awaitResult<JsonArray?> { sqlConnection.querySingle(sql, it) }
suspend fun querySingle(sql: String, params: JsonArray) = awaitResult<JsonArray?> { sqlConnection.querySingleWithParams(sql, params, it) }

suspend fun update(sql: String) = awaitResult<UpdateResult> { sqlConnection.update(sql, it) }
suspend fun update(sql: String, params: JsonArray) = awaitResult<UpdateResult> { sqlConnection.updateWithParams(sql, params, it) }

suspend fun commit() = awaitResult<Void> { sqlConnection.commit(it) }
suspend fun rollback() = awaitResult<Void> { sqlConnection.rollback(it) }
}

This class is made simpler to use by adding an extension function to the Vert.x SQLClient:

1
2
3
4
5
6
7
8
inline suspend fun <R> SQLClient.transaction(block: (AsyncSQLConnection) -> R): R {
val connection = awaitResult<SQLConnection> { getConnection(it) }
try {
return AsyncSQLConnection(connection).transaction(block)
} finally {
connection.close()
}
}

Now to use it you can simply launch a coroutine context, and query in a similar way to classic JDBC where the results will be returned directly.

1
2
3
4
5
launch(vertx.dispatcher()) {
sqlClient.transaction {
val task = querySingle("SELECT * FROM tasks WHERE id = ?", JsonArray(listOf(id)))
}
}

The great part is that this code isn’t using runBlocking but also completely avoids callbacks, while the Kotlin compiler writes in fantastically clear stack-traces (allowing for much easier debugging). You can further enhance the class by adding vararg parameter sets to the functions, as done here:

1
2
3
suspend fun query(sql: String, vararg params: Any?) = query(sql, JsonArray(listOf(*params)))
suspend fun querySingle(sql: String, vararg params: Any?) = querySingle(sql, JsonArray(listOf(*params)))
suspend fun update(sql: String, vararg params: Any?) = update(sql, JsonArray(listOf(*params)))

Matching Similar Strings

I recently needed some code to compare the descriptions of bank transactions, and look for “matches”. On the surface this sounds pretty simple, but on closer inspection there are many variations of any bank transaction description. Many transactions include a date in their description which will always be different. So: strip the numbers right? Except many transactions are just a date and an account number. Okay, so ust remove the dates? But then: what about transactions marked PENDING, or transactions that are truncated, or transactions that change some random bit of their description with a transaction ID, or…

Well, we had a lot of rules and things still weren’t ideal. I spent days thinking about it, and didn’t really get anywhere except creating more rules and exceptions. Until on the way home one night, I had a thought: “Character Ngrams”. If you don’t know it already, an ngram is just a “group of something” but the way they’re generated is special. For example: the string "Character Ngrams" if tokenized into “bigrams” (two characters each) look like this:

1
"Ch", "ha", "ar", "ra", "ac", "ct", "te", "er", "r ", " N", "Ng", "gr", "am", "ms"

What is great about these is that they help you reconstruct them, add more characters and they become easier to reconstruct. You’ll see why this is useful later on. So the first thing I wrote was a tokenize function that returned 3 character ngrams of a string, but as a Set since this code is about comparing the differences (and therefore similarity) of two strings:

1
2
3
4
5
fun tokenize(description: String, tokenSize: Int = 3) =
(0..description.length - tokenSize)
.asSequence() // no need for the intermediary List object
.map { idx -> description.substring(idx, idx + tokenSize) }
.toSet()

We also needed a simple function to run some basic transformation on each string. Here I’ll use a regex, but it can be done faster using more manual processes (I’m going for readability here, not speed):

1
2
3
4
5
fun sanitize(description: String) =
description
.toLowerCase()
.replace("[^A-Za-z0-1]", "")
.replace("\\s+", " ")

Finally, here is the actual isMatch implementation. I’ve marked out where you might want to tweek the values for your own uses:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
fun isMatch(s1: String, s2: String): Boolean {
val sanitized1 = sanitize(s1)
val sanitized2 = sanitize(s2)

// do a quick prefix / equality check to cover simple cases
if((sanitized1.length > sanitized2.length && sanitized1.startsWith(sanitized2))
|| (sanitized2.length > sanitized1.length && sanitized2.startsWith(sanitized1))
|| sanitized1 == sanitized2)
return true

val tokens1 = tokenize(sanitized1)
val tokens2 = tokenize(sanitized2)

// subtract the one set from the other, only the differences are left
val common = tokens2 - tokens1

// how much commonality do the two sets have?
// 0.4 - 0.6 are good values, tweek it to suit your needs
return (common.size.toDouble() < floor(tokens2.size.toDouble()) * 0.4)
}

This code should work on all variations of Kotlin (JVM, JS or Native) and is very easy to port to other languages. I suggest writing a trueth list of strings you want to match, and others you want to !match and then tune things until you’re happy.

The advantage of using ngrams instead of just comparing the characters is that it makes the algorithm order-aware while still handling spelling mistakes and similar issues. A character-by-character implementation would consider “abc123” an exact-match to “321cba” (or any other combination of those 6 characters).

Finally: you can make this a String extension function very easily, but I wanted to keep the example simple.

Fun with String.intern()

The intern() method in the String class is one of the lesser-known gems of the Java world. It’s quite subtle, brilliantly powerful, and potentially very dangerous. That having been said, it has long been one of my favourite bits of core Java API, mostly because it’s incredibly flexible in what it allows you to do.

What is intern()?

What does String.intern() actually do? It makes strings that look the same (ie: have the same content), into the same String object. For a more in-depth understanding, it’s worth going back to one of the most basic Java lessons: “Don’t use == on a String, use .equals”.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Console cmd = System.console();
String input = cmd.readLine("Enter Your Name: ");

if(input == "Henry") {
// This will never ever be true
}

if(input.equals("Joe")) {
// Well, this might be true
}

if(input.intern() == "Jeff") {
// Surprise, this might also be true... just like .equals
}

Okay, so what’s going on here? The answer is that the == operator tests whether two primitives are the same, so it works for int, long, boolean, double and friends but it doesn’t appear to work for objects at first. This is because in Java, objects are all references so what you are testing is “is this object the same object as another object” rather than “does this object have the same data and fields as another object”. Put another way “are these two references pointing to the same memory location?”

Referencing Basics

The String.intern() method allows you to make use of the string constant-pool in the JVM. Every time you intern a String, it’s checked against the VM’s pool of Strings and you get a reference back that will always be the same for any given bit of contents. “Hello” will always == “Hello”, and so on. This internal pool of Strings is what causes this behaviour:

1
2
3
4
5
6
7
8
String string1 = "Hello";
String string2 = "Hello";
String string3 = new String("Hello");
String string4 = string3.intern();

assert string1 == string2; // true
assert string2 == string3; // false
assert string1 == string4; // true

When the Java compiler and VM encounter string1 and string2 they see that they have exactly the same content, and since String is an immutable type they’re turned into the same object under the hood. string3 is different, we used the new operator effectively forcing the VM to create a new object for us. string4 however asks the VM to use it’s string-pool by invoking the intern() method.

Getting Clever

You can use interned String references to massively speed-up configuration key lookup. If you declare the possible configuration properties as constant String objects in an interface, you can intern() the keys when they are loaded externally. This allows you to use an IdentityHashMap instead of a traditional HashMap.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public class Config {
public static final String REMOTE_SERVER_ADDRESS = "remoteServer.address";

private final Map config = new IdentityHashMap();

public void load(Properties config) {
config.stringPropertyNames()
.forEach((k) => config.put(k.intern(), config.getProperty(k)));
}

public String get(final String key) {
return config.get(key);
}
}

The above snippet will only ever work when the Config.get method is called with constants (static final) or other Strings that are interned.

Even more fun

Every field, method and class in Java has a name. It’s name (or identifier) is also stored in the same string-pool as the intern() method accesses. Meaning that there are some very strange things that start to happen with reflection:

1
2
3
4
5
6
public class Main {
public static void main(String[] args) throws Exception {
// this is true!
Main.class.getMethod("main", String[].class).getName() == "main";
}
}

Warnings

  • Using String.intern might cause unexpected behaviour (ie: it can violate the principal or least astonishment), and make your code more confusing to your team-mates, most of us don’t expect to see identity-equality checks with Strings. If you are tempted to use it you should check with your team before doing so.
  • Invoking intern is expensive in it’s own right, making it something to use with care. It’s fine when loading a configuration file on startup, it’s not something you want to do for ever equality check.
  • The string-pool is a precious resource, and strings within it might not be garbage collected. For this reason you should never use intern on untrusted input.