The Most Amazing PostgreSQL Database

For me PostgreSQL is the most amazing (open source) database around. Even though, there is much interest in stripped down NoSQL databases like key-value stores or “data structure servers”, PostgreSQL continues to innovate at the SQL frontier.

In this post, I’ll show a few of the newer, less known, features of PostgreSQL - far beyond standard SQL.

hstore

hstore is a key-value store for simple data types. Using hstore, we’re able to create a key-value store within columns of a table.

To enable the hstore extension, run create extension hstore' in the PostgreSQL prompt. After that the hstore data type is available for our table definitions.

Let’s create a simple table with a hstore column:

create table hstoretest ( id serial primary key, data hstore );

To insert a few rows, we use a special syntax:

insert into hstoretest (data) values ('key1 => 123, key2 => "text"'::hstore);

Query the table as usual:

select * from hstoretest;

 id |                           data
----+-----------------------------------------------------------
  1 | "key1"=>"123", "key2"=>"text"

The hstore extension provides a lot of operators and functions to work with hstore columns, for example, selecting all key2 values:

select data -> 'key2' as key2 from hstoretest;

 key2
------
 text
(1 row)

Some more examples can be found here.

JSON

A JSON data type was introduced in release 9.2. Currently this is nothing more than a validating data type, thus it checks if the string we put into that column is a valid JSON object.

Let’s create a new table to play around with this type:

create table jsontest ( id serial primary key, data json );

Now let’s insert an invalid row:

insert into jsontest (data) values ('{"title":wrong}');

ERROR:  invalid input syntax for type json
LINE 1: insert into jsontest (data) values ('{"title":wrong}');
                                            ^
DETAIL:  Token "wrong" is invalid.
CONTEXT:  JSON data, line 1: {"title":wrong...

And now with the correct JSON syntax:

insert into jsontest (data) values ('{"title":"right"}');

There isn’t really much more to the JSON data type besides the ability to return rows of non-JSON tables as JSON:

select row_to_json(hstoretest) from hstoretest;

                                       row_to_json
-----------------------------------------------------------------------------------------
 {"id":1,"data":"\"key1\"=>\"123\", \"key2\"=>\"text\""}
(1 row)

Nice if you’re used to work with JSON object (in Web applications for example).

PLv8

Working directly with JSON and JavaScript has been all the rage in many of the NoSQL databases. Using the PLv8 extension, we can use JavaScript (executed by Google’s awesome V8 engine) directly in PostgreSQL. Together with the JSON data type, this offers amazing new possibilities.

Currently PLv8 isn’t included in the standard distribution of PostgreSQL (9.2), but installing it isn’t very hard, the only dependencies are postgresql and the v8 engine. Some distributions already have v8 in their repositories (Archlinux).

Compiling and installing the extension is straight forward as soon as the dependencies are in place:

make && sudo make install

No we can enable the plv8 extension within our database (as we did with hstore):

create extension plv8;

A particular nice example of using JSON and PLv8 comes from Andrew Dunstan:

create or replace function jmember (j json, key text )
 RETURNS text
 LANGUAGE plv8
 IMMUTABLE
AS $function$
  var ej = JSON.parse(j);
  if (typeof ej != 'object')
        return NULL;
  return JSON.stringify(ej[key]);
$function$;

The jmember function allows us to parse and read the JSON string and returns the member identified by key:

select jmember(data, 'title') from jsontest;

     jmember
-----------------
 "right"
(1 row)

Andrew also shows how to build an index to speed up access times in his post.

k-Nearest Neighbors

In PostgreSQL 9.1, a nearest neighbor indexing was introduced. This allows us to perform orderings etc. by a distance metric.

For example, I’ve downloaded the ispell spelling dictionaries, and loaded them into a table words like this:

create table words (word varchar(50) primary key);
copy words from 'english.0';

This inserts roughly 50000 words into the table.

Since we’re working with text data, let’s introduce another extension pg_trgm, which builds tri-grams of strings (triples of three characters). Using theses tri-grams, we can compute a distance metric. Enable the extension like this:

create extension pg_trgm;

A tri-gram of hello would look like this:

select show_trgm('hello');
            show_trgm
---------------------------------
 {"  h"," he",ell,hel,llo,"lo "}
(1 row)

The distance metric is very simple, the more of these tri-grams match, the closer two strings are.

To take advantage of the nearest neighbor index, we have to build it:

create index word_trgm_idx on words using gist (word gist_trgm_ops);

Using the index we can query our table for a word, and return a list of most similar terms as well:

select word, word <-> 'hello' as distance from words order by word <-> 'hello' asc limit 10;
  word  | distance
--------+----------
 hello  |        0
 hellos |    0.375
 hell   | 0.428571
 hells  |      0.5
 heller | 0.555556
 hell's | 0.555556
 help   |    0.625
 helm   |    0.625
 held   |    0.625
 helps  | 0.666667
(10 rows)

The <-> operator comes from the pg_trgm extension, of course we could use simpler distances like numerical difference or geometric distance, but working with textual data is often perceived as particularly difficult (not so with PostgreSQL).

Now Postgres is more or less the first choice of people with experience running production databases. It’s more powerful, more reliable, and has a better set of features than any other open source data storage layer out there today.
— Peter van Hardenberg, Tech Lead, Heroku Postgres

A REST API in Clojure

Clojure is one of the most interesting new languages targeting the JVM. Initially only the JVM, in the meantime it is also available for JavaScript. Essentially, you can write Clojure and either execute it as Java program or JavaScript program, of course each flavor has its unique features as well.

Clojure is a Lisp, thus the syntax may be foreign, but it is really, really easy since there are very few syntactic variations. The language “Lisp” is very lean and usually easily learned.

In this post, we’re going to create a complete REST application from scratch. There are already some (very) good tutorials available, but some are not quite up to date (see Heroku’s Devcenter or Mark McGranaghan for good ones). Clojure itself is still a young language, Lisp of course has a lot of history.

Our application should allow creating, listing, fetching, updating, and deleting of documents.

A document looks like this (JSON encoded):

    {
      "id" : "some id"
      , "title" : "some title"
      , "text" : "some text"
    }
  • A GET call to /documents should return a list of these documents.
  • A POST call to /documents with a documents as body shall create a new document, assigning a new id (ignoreing the posted one).
  • A GET to /documents/[ID] should return the document with the given id, or 404 if the document does not exist.
  • A PUT to /documents/[ID] should update the document with the given id and replace title and text with those from the document in the uploaded body.
  • A DELETE to /documents/[ID] should delete the document with the given id and return 204 (NO CONTENT) in any case.

Creating the project scaffolding

We’re going to use Leiningen, the defacto build system and dependency manager for Clojure projects. Download and install it, then execute:

    lein new compojure clojure-rest

We’re creating a new Compojure project called clojure-rest. Compojure is the library that maps URLs to functions in our program. Compojure (and our project) builds on Ring is the basic Server API. To start the new project run:

    lein ring server

This starts the server on localhost:3000 and automatically restarts the server if any of the project files change. Thus, you can leave it running while we develop our application.

The new command generates two very important files for us:

project.clj is the project configuration. It states dependencies, the entry point etc. (read the whole documentation on Leiningen.org, and src/clojure_rest/handler.clj which contains a starting point for our application.

Project configuration (project.clj)

    (defproject clojure-rest "0.1.0-SNAPSHOT"
      :description "FIXME: write description"
      :url "http://example.com/FIXME"
      :dependencies [[org.clojure/clojure "1.4.0"]
                     [compojure "1.1.1"]]
      :plugins [[lein-ring "0.7.3"]]
      :ring {:handler clojure-rest.handler/app}
      :profiles
      {:dev {:dependencies [[ring-mock "0.1.3"]]}})

Update the file to look like this:

    (defproject clojure-rest "0.1.0-SNAPSHOT"
      :description "REST service for documents"
      :url "http://blog.interlinked.org"
      :dependencies [[org.clojure/clojure "1.4.0"]
                     [compojure "1.1.1"]
                     [ring/ring-json "0.1.2"]
                     [c3p0/c3p0 "0.9.1.2"]
                     [org.clojure/java.jdbc "0.2.3"]
                     [com.h2database/h2 "1.3.168"]
                     [cheshire "4.0.3"]]
      :plugins [[lein-ring "0.7.3"]]
      :ring {:handler clojure-rest.handler/app}
      :profiles
      {:dev {:dependencies [[ring-mock "0.1.3"]]}})

Besides the JSON parsing library Cheshire, we added the C3P0 Connection Pool, the H2 Database JDBC driver and Clojure’s java.jdbc contrib-library.

I also updated the :url and :description fields.

The request handler (handler.clj)

Next let’s have a look at the generated request handler src/clojure_rest/handler.clj:

    (ns clojure-rest.handler
      (:use compojure.core)
      (:require [compojure.handler :as handler]
                [compojure.route :as route]))

    (defroutes app-routes
      (GET "/" [] "Hello World")
      (route/not-found "Not Found"))

    (def app
      (handler/site app-routes))

The route GET "/" [] "Hello World" is responsible for our result we saw in the browser. It maps all GET requests to / without parameters to "Hello World". The (def app (handler/site app-routes)) part configures our application (registering the routes).

Our first step is to update the configuration. We’re going to work with JSON, so let’s include some Ring middlewares to setup response headers (wrap-json-response) and parse request bodies (wrap-json-body) for us. A middleware is just a wrapper around a handler, thus it can pre- and post-process the whole request/response cycle.

    (def app
      (-> (handler/api app-routes)
        (middleware/wrap-json-body)
        (middleware/wrap-json-response)))

We switched also from the handler/site template to handler/api which is more appropriate for REST APIs (documentation).

Next let’s define the routes for our application:

    (defroutes app-routes
      (context "/documents" [] (defroutes documents-routes
        (GET  "/" [] (get-all-documents))
        (POST "/" {body :body} (create-new-document body))
        (context "/:id" [id] (defroutes document-routes
          (GET    "/" [] (get-document id))
          (PUT    "/" {body :body} (update-document id body))
          (DELETE "/" [] (delete-document id))))))
      (route/not-found "Not Found"))

We define GET and POST for the context "/documents, and GET, PUT, DELETE for the context ":id" on top of that. :id is a placeholder and can then be injected into our parameter vector. The POST and PUT request have a special parameter body for the parsed body (this parameter is provided by the wrap-json-body middleware. For more on routes, take a look at Compojure’s documentation.

Before we define the functions to carry out the requests, let’s fix the imports and open a pool of database connections to work with.

The namespace declaration is used to define which namespaces shall be made available by Clojure.

    (ns clojure-rest.handler
      (:import com.mchange.v2.c3p0.ComboPooledDataSource)
      (:use compojure.core)
      (:use cheshire.core)
      (:use ring.util.response)
      (:require [compojure.handler :as handler]
                [ring.middleware.json :as middleware]
                [clojure.java.jdbc :as sql]
                [compojure.route :as route]))

We import C3P0’s ComboPooledDataSource, a plain Java class. Next, we fetch the functions defined in compojure.core, cheshire.core, and ring.util.response into our namespace, they can be used without qualifying. Finally we require some more libraries, this time with a qualifier to prevent name clashes or to support nicer separation. I’m not sure when to make the cut between :use and :require yet, so the cut is abitrary.

    (def db-config
      {:classname "org.h2.Driver"
       :subprotocol "h2"
       :subname "mem:documents"
       :user ""
       :password ""})

Note, we use a in-memory database. If you’d like to keep your database between restarts, you could use :subname "/tmp/documents" for example.

Next we open a pool of connections. C3P0 has no Clojure wrapper, so we deal with Java classes and objects directly (hence a bit more code).

    (defn pool
      [config]
      (let [cpds (doto (ComboPooledDataSource.)
                   (.setDriverClass (:classname config))
                   (.setJdbcUrl (str "jdbc:" (:subprotocol config) ":" (:subname config)))
                   (.setUser (:user config))
                   (.setPassword (:password config))
                   (.setMaxPoolSize 6)
                   (.setMinPoolSize 1)
                   (.setInitialPoolSize 1))]
        {:datasource cpds}))

    (def pooled-db (delay (pool db-config)))

    (defn db-connection [] @pooled-db)

Since we deal with a in-memory database, we need to create our table now.

    (sql/with-connection (db-connection)
      (sql/create-table :documents [:id "varchar(256)" "primary key"]
                                   [:title "varchar(1024)"]
                                   [:text :varchar]))

The intent should be easy to understand, for the details take a look at the java.jdbc documentation. We create a table documents with a :id, :title, and :text column. Note that the database column is called id, not :id.

The only thing missing are the functions to actually perform the actions requested by our clients.

To return a single document with a given id, we could come up with this:

    (defn get-document [id]
      (sql/with-connection (db-connection)
        (sql/with-query-results results
          ["select * from documents where id = ?" id]
          (cond
            (empty? results) {:status 404}
            :else (response (first results))))))

It reads like this: when called with an id, open a database connection, perform select * from documents where id = ? with the given id as parameter. If the result is empty, return 404, otherwise return the first (and only) document as response.

The response call will convert the document into JSON, this functionality is provided by wrap-json-response, which also sets the correct Content-Type etc.

Another nice one is the creation of new documents:

    (defn uuid [] (str (java.util.UUID/randomUUID)))

    (defn create-new-document [doc]
      (let [id (uuid)]
        (sql/with-connection (db-connection)
          (let [document (assoc doc "id" id)]
            (sql/insert-record :documents document)))
        (get-document id)))

Here we use Java’s UUID generator (without import, hence the full package name) to generate a new id for each document created. The second let statement is responsible to replace the user-provided id (if any) with our generated one. Remember that Clojure’s datastructures are immutable, so we need to use the document variable thereafter, instead of the doc which still contains the old (or no) id.

Returning the document is delegated to the get-document function.

The complete handler.clj

To round the post up, here is the whole program:

    (ns clojure-rest.handler
      (:import com.mchange.v2.c3p0.ComboPooledDataSource)
      (:use compojure.core)
      (:use cheshire.core)
      (:use ring.util.response)
      (:require [compojure.handler :as handler]
                [ring.middleware.json :as middleware]
                [clojure.java.jdbc :as sql]
                [compojure.route :as route]))

    (def db-config
      {:classname "org.h2.Driver"
       :subprotocol "h2"
       :subname "mem:documents"
       :user ""
       :password ""})

    (defn pool
      [config]
      (let [cpds (doto (ComboPooledDataSource.)
                   (.setDriverClass (:classname config))
                   (.setJdbcUrl (str "jdbc:" (:subprotocol config) ":" (:subname config)))
                   (.setUser (:user config))
                   (.setPassword (:password config))
                   (.setMaxPoolSize 1)
                   (.setMinPoolSize 1)
                   (.setInitialPoolSize 1))]
        {:datasource cpds}))

    (def pooled-db (delay (pool db-config)))

    (defn db-connection [] @pooled-db)

    (sql/with-connection (db-connection)
    ;  (sql/drop-table :documents) ; no need to do that for in-memory databases
      (sql/create-table :documents [:id "varchar(256)" "primary key"]
                                   [:title "varchar(1024)"]
                                   [:text :varchar]))

    (defn uuid [] (str (java.util.UUID/randomUUID)))

    (defn get-all-documents []
      (response
        (sql/with-connection (db-connection)
          (sql/with-query-results results
            ["select * from documents"]
            (into [] results)))))

    (defn get-document [id]
      (sql/with-connection (db-connection)
        (sql/with-query-results results
          ["select * from documents where id = ?" id]
          (cond
            (empty? results) {:status 404}
            :else (response (first results))))))

    (defn create-new-document [doc]
      (let [id (uuid)]
        (sql/with-connection (db-connection)
          (let [document (assoc doc "id" id)]
            (sql/insert-record :documents document)))
        (get-document id)))

    (defn update-document [id doc]
        (sql/with-connection (db-connection)
          (let [document (assoc doc "id" id)]
            (sql/update-values :documents ["id=?" id] document)))
        (get-document id))

    (defn delete-document [id]
      (sql/with-connection (db-connection)
        (sql/delete-rows :documents ["id=?" id]))
      {:status 204})

    (defroutes app-routes
      (context "/documents" [] (defroutes documents-routes
        (GET  "/" [] (get-all-documents))
        (POST "/" {body :body} (create-new-document body))
        (context "/:id" [id] (defroutes document-routes
          (GET    "/" [] (get-document id))
          (PUT    "/" {body :body} (update-document id body))
          (DELETE "/" [] (delete-document id))))))
      (route/not-found "Not Found"))

    (def app
        (-> (handler/api app-routes)
            (middleware/wrap-json-body)
            (middleware/wrap-json-response)))

Yeah, the whole program with connection pooling, JSON de/encoding in roughly 90 lines of (admittedly dense) code.

To sum it up: Clojure is fun, concise, and very powerful. Together with the excellent Java integration it ranks very high on my “languages I adore” list.

Programming is not about typing, it’s about thinking.
— Rich Hickey

REST Framework Survey (Java, Haskell, Go, Node.js)

Over the last few days I’ve experimented with various REST frameworks. The initial goal was to find The Language and framework to use for all future projects… . Of course there is no clear winner.

These were the contestants:

The client was written in Go because I wanted it to be fast and easy to make concurrent requests to try Go.

The server spec is to povide a way to insert, query, update, list and delete documents.

  • A POST to /documents should create a new document, generate a new UUID (v4) as ID and return the whole document.
  • A GET to /documents should return a list of all documents
  • A GET to /documents/[ID] should return the document with the given ID, or 404 if it is not found.
  • A PUT to /documents/[ID] should update the document with the given ID and return it.
  • A DELETE to /documents/[ID] should delete the document with the given ID.

The documents are encoded in JSON:

{
    "id": "some id",
    "title": "some title",
    "text": "some text"
}

The client may or may not send the id field, the server ignores it and either generates a new one, or uses the one from the URL.

The client’s steps:

  1. Insert a document
  2. Update that document
  3. Delete that document

I used goroutines to make that concurrent and checked the consistency via a final GET /documents asserting an empty list.

Step 1

As database backend, I used an SQLite3 in-memory database connected via:

The implementation was easy with Java and Node.js. I’ve had my fair share of trouble with Haskell because there is no clear REST framework to use, at least I couldn’t find the obvious choice and went with Scotty because it’s simple and did what I needed, and my Haskell-fu has seen better days.

Here’s a inconclusive performance report1 (how long it took, in seconds, so longer is worse) processing 10000 documents with the same client on each of the backends.

Node.js is a bit slower, Java and Haskell are pretty much on par.

Code

Here’s the code used to update (PUT) the new version of a document on the server. I’ve chosen the update case because it shows how to deal with path-parameters, as well as how to decode and encode JSON.

Node.js Code to wire PUT requests to the DB.

server.put('/documents/:docId', update); // define which function to call for that URL+method
...
function update(req, res, next) {
  var doc = req.body; // directly available as object through the BodyParser plugin
  doc.id = req.params.docId; // taken from the URL
  db.run("update documents set title = ?, text = ? where id = ?", doc.title, doc.text, doc.id);
  return read(req, res, next); // read the document or return 404
}

The Jersey annotations used to declare “updateDocument” as handler for PUT requests. JSON en/decoding is fully transparent (because of the @Produces annotation).

@PUT // handle PUT requests
@Path("/documents/{id}") // on that URL
@Consumes(MediaType.APPLICATION_JSON) // accepts only JSON
@Produces(MediaType.APPLICATION_JSON) // writes out JSON
// params can be references and injected, the body is automaticall decoded into the correct object
public Document updateDocument(@PathParam("id") String id, Document document) throws SQLException {
    // perform the database stuff (in a data access object as it is customary in Java)
    Document document = documentDao.update(id, document);
    if (document == null) {
        throw new NotFoundException(); // produces the 404 status
    }
    // the returned object is also automatically encoded
    // and all headers are set correctly
    return document;
}

The Haskell version is very concise, but with the various Monad-layerings a bit opaque (especially the DB code):

put "/documents/:id" $ do -- the URL this function is defined for
  id <- param "id" -- extract the parameter from the URL
  inputDocument <- jsonData -- parse the JSON body
  doc <- liftIO $ updateDocument conn id inputDocument -- write the stuff into the DB
  resultOr404 doc -- return the document or 404

-- for reference on how to deal with Maybe
resultOr404 :: Maybe Document -> ActionM ()
resultOr404 Nothing  = status status404 -- return 404 without a body
resultOr404 (Just a) = json a -- return JSON (also setting the content type)

Actually, the only server checking for the existance of the document is the Haskell variant. Simply because the type system enforces it!

Finally, here is the corresponding Go client code:

func updateDocument(id string) {
    doc := Document{id, "New Title", "New Text"} // set the new content
    jsonReader := getJsonReader(doc) // encode into JSON
    req, _ := http.NewRequest("PUT", base + "/" + id, jsonReader) // prepare the call
    req.Header.Add("Content-Type", jsonType) // set the correct content-type
    res, _ := client.Do(req) // execute the call
    if res != nil { // check for a response
        res.Body.Close() // close the response stream
    }
    // yeah, I ignore errors...
}

Step 2

After trying out the services with the in-memory database, I got curious and wanted to see how they’d perform using a PostgreSQL Database. So I switched the database layer to these:

The switch was pretty easy for Java and Haskell (a matter of exchanging the database driver and connect-string). For Node.js I had to rewrite more or less the whole app since there seems to be no standard interface for DB access.

The client does not need to change.

The performance of the REST services with Postgres instead of SQLite are shown below. This time they show the time needed to process 1000 documents. I wasn’t patient enough to wait for 10000 documents to finish – I took the average of three runs.

This time, Node blew Java away, Haskell was also significantly slower than Node. My guess is, that the JDBC and HDBC abstractions take their fair share of overhead, but Java’s extreme case might have another cause (I haven’t investigated here).

Conclusion

Writing v1

In general, writing the server in Java was quite easy (I’m used to that), Node.js was easy too (chaning was hard), and Haskell took me more time than the other two servers and the Go client together. I’d love to add Clojure and a Go server to the mix and see how they perform.

All three contestants work well and are reasonably fast. I never noticed Java’s problem with Postgres as shown here, but I rarely use JDBC directly these days.

Update to v2

The switch from SQLite to Postgres was pretty painful in Node.js. For Java and Haskell the switch was easy and fast.

Deployment

To deploy a Java webapp, you need a servlet container (I used Jetty) and deploy it within that. The war file usually includes all necessary dependencies. Every Node.js app starts its own server on its own port, so there isn’t much to it: install Node.js, install the required libraries with npm, Profit! Haskell produces a single binary which can be copied to the destination. Starting the binary, also starts the webserver and can thus be used directly or via a proxy. There are still some dynamic linked libraries which have to be present on the target machine.

The nice thing about Java is, that all necessary dependencies are bundled with the app itself. Haskell and its type system also check whether newer versions are still compatible, for Node.js it seems that our only option is to perform adequate testing.

Summary

This sums up my endeavour: I’ve had the most fun writing the Go client, but the Haskell services has the fewest possibilities for error. Haskell is really, really cool for writing REST services, and performs very good. Node.js is very young and provides very good productivity and performance to get something up and running, but (to me), maintaining Node.js code seems like something I wouldn’t want to do. Java is a compromise, nobody got every fired for using Java (except, maybe someone on the Android team).

  1. What else to do with three similar REST services?

SQL, Lisp, and Haskell are the only programming languages that I’ve seen where one spends more time thinking than typing.
— Philip Greenspun

Programming Languages in Joy and Sorrow

Most of us (programmers) know, and need to know, many programming languages. Some aren’t even perceived as programming languages any more (Shell-Scripts), some make us a living (Java, C# etc.), some are hard to replace (JavaScript), and some are just fun to play with (make your choice).

What makes programming languages differ is not syntax, syntax is nothing more than a mechanical translation, much like a cipher: back and forth without gaining or losing information.

General purpose language don’t even differ in what they’re able to express. All of them are “Turing Complete”1 and hence are equally potent.

What makes a difference is how programming in these languages feel like. Even though Assembler and JavaScript are equally potent (in theory), they are two very different beasts.

What appeals to me are just a handful properties of a language:

  • simplicity,
  • expressiveness,
  • performance,
  • productivity.

Simplicity

If a programming language is complicated by itself, programs become brittle, programmers are distracted, there is no uniform structure etc. C++ is a wonderful and very expressive language, but it has so many ways of doing things, so many features to learn (and distract) that I would not care to use it again.

Simplicity was one of C’s big appeals, the language is very lean and has a small set of features. The language features were carefully chosen to be general enough to not be too restrictive.

Expressiveness

How much code do I need to write to get a job done? How good are the means of abstraction (how often do I need to repeat myself)?

Assembler doesn’t have much abstraction capabilities (mostly jumps), whereas functional languages offer powerful ways of separating, combining, and reusing code blocks.

A simple and expressive language is easy to understand, and has few, but powerful, means for abstraction.

Performance

Computers used to get faster every year for some time, currently we’ve reached a plateau and instead of scaling up, we’re scaling out by adding cores and machines.

Our programs don’t get faster by waiting for a better machine anymore. We need to actively take additional cores into consideration. The future (and probably the present) is distributed!

I think performance is still a major merit of today’s software. If your service can’t scale to “internet scale”2, you’ll lose. If your competitor offers the same set of features, but twice as fast, you’ll lose.

So a modern programming does not need to be the fastest one on a single machine and a single core, but it should be reasonable fast and scale easily to threads and processes.

Productivity

Computers have become insanely fast, but programmer productivity stayed the same. Sure, we can’t call Intel for a brain-upgrade, but we can choose and provide the right set of abstractions to support us.

Besides the simplicity and expressiveness of a language, productivity involves the available libraries, the community and the culture of the community. Java, for example, has a very active Open Source culture, quite untypical for a business related language.

“Academic” (or, let’s call them “non-mainstream”) languages are often very beautiful and expressive, but their lack of practical libraries makes it hard to get up and running quickly. We’re in an age of quick-fixes and easy gains – what’s the point of choosing the newest language du jour if we can’t deliver faster (in the long run)? What’s the point of “engineering” if, in the end, we need even more time than by hacking3?

Productivity, for me, heavily affects the fun in programming and mostly subsumes simplicity and expressiveness.

Language assessment

In a business setting you rarely have a free choice of weapons, but for personal or pet projects, you can do that. Most people I know stick to the language from work because it is well known. Some are increasingly unsatisfied with their day-job language and set out for the search of their “own” language. I can’t tell you what the best language is. Everyone is different, like Yukihiro Matsumoto said here:

No language can be perfect for everyone. I tried to make Ruby perfect for me, but maybe it’s not perfect for you. The perfect language for Guido van Rossum is probably Python.

You are different from everyone else. Embrace the difference. Use your brain and make you own choice.

Of course, given the sheer amount of programming languages, making an educated guess is crucial. Taking only the 100 most popular languages into account is surely not too far off.

Go Philosophical

Languages are divided into a few categories concerning programming paradigm (object oriented, functional, procedural…), typing (strong, weak, static, dynamic) etc.

My personal preference is with statically typed functional languages because they provide very good abstractions and safety through the compiler.

Look for a language with good interop functionality.

We’ve invested a lot of time in building libraries, helpers, utilities etc. Of course starting from scratch is fun, but it’s rarely economical.

If you used to program in Java, start looking for JVM languages for example.

Read some code before diving in

Typing.io is a nice starting point for some languages, and of course, there is always github.

Do you feel comfortable with the layout, the language, does it feel natural? Do you even understand some of it?

Enough with the subjective assessment, show me something objective!

Here is an unbiased and objective survey of a few programming languages I’m interested in. It’s totally foolproof and you certainly should start a multimillion dollar company on it!

This chart shows the ratio of the number of search results for

"c programming" "sucks"

and

"c programming" "rocks"

as returned by Yahoo!.

That is, "c programming" "sucks" returns 20800 results, "c programming" "rocks" returns 222000 results, hence a ratio of 10.6731 in the chart. Thus, everything below 1 means, there are more “sucks” results than “rocks” results.

Since Haskell, Clojure, and Scala completely dominate the chart, here a version with the obvious winners removed:

The numbers were computed using Google Docs - Spreadsheets and its awesome ImportXML function:

=ImportXML("http://search.yahoo.com/search?p="&C4,"//span[@id='resultCount']")

Cell C4, referenced in the above code, contained the URL encoded search-string, for example %22c programming%22 %22rocks%22. Try it with prolog ;-).

  1. Even some configuration files are considered Turing Complete, like sendmail’s.

  2. Whatever internet scale is in your domain.

  3. You’ve got to measure the whole lifetime of a project, it doesn’t help to get something working very quickly but with huge amounts of bugs, hard to change etc. I guess that’s a topic for a whole book on its own.

Measuring programming progress by lines of code is like measuring aircraft building progress by weight.
— Bill Gates

In the beginning was the Test...

JUnit offers many features besides the standard assertTrue/assertEquals methods most programmers use. Let’s browse through the newer and more exotic features. They might come in handy at some time.

JUnit 4.9 Feature Roundup

Assuming you know something about unit testing and JUnit in particular, I won’t start at the very bottom, but talk a little about the features introduces during the last few versions:

  • Matchers
  • Assumptions
  • Categories
  • Theories
  • Rules

I hope there is something new in here for you. JUnits javadoc documentation is very good, but there is no single place describing these features. It’s not my goal to give a thorough treatment of them here, but it might be a good starting point.

Matchers

By including Hamcrest (core) into the default JUnit distribution, JUnit now allows the usage of assertThat leading to much easier to read tests and better error messages:

@Test
public void testUsingAssertThat() {
  assertThat(42, is(greaterThan(43))); // note, this will fail
}

JUnit includes only the Hamcrest core matchers, if you want/need more matchers, include hamcrest-all 1.1. Included matchers are documented here for Hamcrest and here for JUnit additions.

Output:

java.lang.AssertionError: 
Expected: is a value greater than <43>
     got: <42>

Assumptions

Assumptions allow tests to be ignored if the assumed condition isn’t met (instead of failling).

This test will be ignored if it is run on a Windows OS (for example):

@Test
public void testUsingAssumeThat() {
  assumeThat(File.separator, is("/"));
  ...
}

It is also possible to use assumptions in @Before or @BeforeClass methods.

Output (for example):

Test 'org.interlinked.junit.assumption.BasicTest.testUsingAssumeThat' ignored
org.junit.internal.AssumptionViolatedException: got: "\", expected: is "/"

Categories

Using categories it is possible to run only a subset of the tests. For example slow tests, integration tests etc.

Here, TestCategoryA and TestCategoryB are empty interfaces used to mark the tests:

@Test
@Category(TestCategoryA.class)
public void testCatA() {
  System.out.println("Category A test");
}

@Test
@Category(TestCategoryB.class)
public void testCatB() {
  System.out.println("Category B test");
}

@Test
@Category({ TestCategoryA.class, TestCategoryB.class })
public void testCatAB() {
  System.out.println("Category A and B test");
}

Using the Categories suite, we can now execute only those tests that are in “Category A”, but not in “Category B”:

@RunWith(Categories.class)
@Categories.IncludeCategory(TestCategoryA.class) // this would run tests CatA and CatAB
@Categories.ExcludeCategory(TestCategoryB.class) // now test CatAB is excluded too
@Suite.SuiteClasses(BasicTest.class)
public class CategoryASuite { }

Output:

Category A test

Theories

With theories we can write parameterized tests. We define a few theories and some datapoints. JUnit will match the types of the datapoints and the theories.

Again, we have to use a special suite class Theories:

@RunWith(Theories.class)
public class TheoryTest {
  @DataPoint public static final String POINT1 = "POINT1";
  @DataPoint public static final String POINT2 = "POINT2";

  // mind the plural!
  // uses only the items of the array, never the whole array!
  @DataPoints public static final String[] POINTS = new String[] {"abc", "cde", "efg", "ghi"};

  @DataPoint public static final String[] POINTS_ARRAY = POINTS;

  @Theory
  public void testTheory(String param) {
      System.out.println("Got: " + param);
  }

  @Theory
  public void testTheoryWithTwoParams(String param1, String param2) {
      System.out.println("Got " + param1 + " and " + param2);
  }

  @Theory
  public void testArray(String[] array) { // gets called with POINTS_ARRAY, nothing else
      System.out.println("Got called...");
      assertThat(array.length, is(equalTo(POINTS_ARRAY.length)));
  }
}

Output:

Got POINT1 and POINT1
Got POINT1 and POINT2
Got POINT1 and abc
Got POINT1 and cde
Got POINT1 and efg
Got POINT1 and ghi
Got POINT2 and POINT1
Got POINT2 and POINT2
Got POINT2 and abc
Got POINT2 and cde
Got POINT2 and efg
Got POINT2 and ghi
Got abc and POINT1
Got abc and POINT2
Got abc and abc
Got abc and cde
Got abc and efg
Got abc and ghi
Got cde and POINT1
Got cde and POINT2
Got cde and abc
Got cde and cde
Got cde and efg
Got cde and ghi
Got efg and POINT1
Got efg and POINT2
Got efg and abc
Got efg and cde
Got efg and efg
Got efg and ghi
Got ghi and POINT1
Got ghi and POINT2
Got ghi and abc
Got ghi and cde
Got ghi and efg
Got ghi and ghi
Got called...

Rules

Finally, rules allow us to add behaviour to tests. They can be thought of some kind of AOP for JUnit. Using rules, we can often omit class hierarchies and still reuse functionality using delegation.

JUnit includes some rules to start with, but it is very easy to write our own rules.

public class RuleTest {
  @Rule public TemporaryFolder temporaryFolder = new TemporaryFolder();
  @Rule public TestName testName = new TestName();
  private static boolean fileCreated = false;
  @Rule public LoggingRule loggingRuld = new LoggingRule();

  @Before
  public void printTestName() {
    System.out.println(testName.getMethodName());
  }

  @Test
  public void testCreatingAFile() throws IOException {
    File newFile = temporaryFolder.newFile("test1");
    assertThat(newFile.isFile(), is(true));
    fileCreated = true;
  }

  @Test
  public  void testCheckIfItExists() { // depends on testCreatingAFile...
    assumeTrue(fileCreated); // just to be sure ;)
    File file = new File(temporaryFolder.getRoot().getAbsolutePath() + "/test1");
    // the file should not exist (unless we use ClassRule for the TemporaryFolder, for example)
    assertThat(file.isFile(), is(false));
  }

  @Test(expected = NullPointerException.class)
  public void testThrowException() {
    throw new NullPointerException();
  }
}

Output:

Starting: testCreatingAFile
testCreatingAFile
Finished: testCreatingAFile
Starting: testCheckIfItExists
testCheckIfItExists
Finished: testCheckIfItExists
Starting: testThrowException
testThrowException
Finished: testThrowException

The TemporaryFolder and TestName rules are included in JUnit, the LoggingRule is a simple example:

public class LoggingRule extends TestWatcher {
  @Override
  protected void starting(Description description) {
    System.out.println("Starting: " + description.getMethodName());
  }

  @Override
  protected void finished(Description description) {
    System.out.println("Finished: " + description.getMethodName());
  }
}

Other rules included (see JUnit’s javadoc):

  • ErrorCollector: collect multiple errors in one test method
  • ExpectedException: make flexible assertions about thrown exceptions
  • ExternalResource: start and stop a server, for example
  • TemporaryFolder: create fresh files, and delete after test
  • TestName: remember the test name for use during the method
  • TestWatcher: add logic at events during method execution
  • Timeout: cause test to fail after a set time
  • Verifier: fail test if object state ends up incorrect

Unfortunately, rules seem to be local to the defining class, so you can’t put the into the suite class like @Before and @BeforeClass (which is really nice for opening external resources once for all tests).

Misc additions

Infinitest

For each change you make, Infinitest runs all the dependent tests. It’s continous testing for Eclipse and IDEA - free and open source (written by inproving works)!

ClasspathSuite

Most IDEs have their own ways for finding test classes to run, but usually I like to be IDE independent. Using the ClasspathSuite it is possible to have JUnit detect all test classes (or a subset of them) within the classpath (written by Johannes Link). There are efforts to include it into the standard distribution of JUnit.

Every programmer knows they should write tests for their code. Few do. The universal response to “Why not?” is “I’m in too much of a hurry.” This quickly becomes a vicious cycle- the more pressure you feel, the fewer tests you write. The fewer tests you write, the less productive you are and the less stable your code becomes. The less productive and accurate you are, the more pressure you feel.
— Kent Beck/Erich Gamma – JUnit Test Infected

Notebooks (dead wood)

Notebooks

Moleskines were all the rage about six years ago. Whole blogs were, and still are, dedicated to them. Today, still many people carry these little black books.

I too loved these little books, but in the last few years I’ve changed some parameter of my journal every time I needed a new one. I changed the brand, the size, I used soft and hard covers, plain, square or ruled (although, they were always black).

It was a fun experiment, but now I’ve come to a conclusion of what suits my needs best: Leuchtturm 1917

either dotted or plain.

The Master is large enough for keeping full A4 pages. I use it for plannings, sketches, technical stuff etc. Most of the time it sits on my desk and waits to be filled - usually one to two pages per day.

The Medium is just the right size for a journal. It fits easily in almost every bag, is large enough for notes, quotes, thoughts and so on. The smallest format, A6, is also quite nice. But for me, the Medium is simply the best choice.

The following sounds like an advertisement, but the tagline of the Leuchtturm 1917 company is “Details machen den Unterschied” (“Details make all the difference”), I have to include some of their noteworthy details:

  • Ink proof paper - earlier editions of the notebooks used a thinner paper (much like the paper used in Moleskine notebooks), but recent editions use a 80g/qm paper - perfectly suited for a fountain pen.
  • The “Dotted” variant is a nice compromise between plain and square, it’s unobtrusive and still provides enough guidance.
  • Numbers and Index - the pages are numbered and a small index is printed on the first few pages.
  • Detachable sheets - some companies offer notebooks with detachable sheets, sometimes you can rip out half of the notebook. Leuchtturm only has 8 sheets which is more than I ever ripped out of a notebook, so I consider this a plus.
  • Labelling stickers - quite nice for organisation-fanatics, during usage we have the slick plain look of the black notebook we all love, but for archiving purposes there are a few labelling stickers in each notebook package.

Even though we’ve access to email, note-taking applications, the Web etc. almost all the time, I still love the old classy feeling of a notebook and (in my case) a fountain pen.

I’ll end this post with a few links for your reading/tinkering pleasure.

So often is the virgin sheet of paper more real than what one has to say, and so often one regrets having marred it.
— Harold Acton

How to Hide Google's "Google Plus Notification Count"

To hide the Google Plus notifications on the upper right of Google’s services, install Adblock (Adblock Plus for Firefox or Adblock for Chrome) and add this rule to your ruleset:

##A#gbg1

The notification count is a link (A) with id gbg1, this rule hides that element. Let me know if it works for you, or not.

Hapiness can only be found if you can free yourself of all other distractions.
— Saul Bellow

JavaScript

Few languages are so clearly worth learning as JavaScript.
– It’s an interesting language that doesn’t restrict how you use it.
– It’s a language you get paid for developing in.
– It’s still cool (who cares about Java anymore?).
– It’s here to stay (for some time, though).

So, to sum it up, I believe investing in JavaScript pays off. Even if you’re an “enterprise developer” or something. JavaScript will get you, sooner or later, so get it first!

Continue to full post...

How to dive into Legacy Code

Diving into legacy code written some time ago can be a daunting task. It doesn’t even matter much if we’ve been writing it ourselves, or somebody else, code rots faster than we’d like to admit.

Currently faced with such a task I tried to do it in a systematic and repeatable manner.

My steps

First try to find the modules and their dependencies. I used IntelliJ IDEA for my current Java project. Since it also uses Maven, finding the dependencies was easy.

Create a graph of module interdependencies. Which modules depend on which? Find the “edges” of the system (modules that do not depend on other modules). I found them to be the best starting point for a more detailed analysis.

Find out what the purpose of each module is. Is it a layer in the system (like a DAO-module)? Is it a cross-cutting concern (model classes)?

The next step is to analyse each module by itself. For this step I recommend using doxygen. It can generate a very good documentation of the software at hand, even if no doxygen (or any other type of markup) was used, by analysing the dependencies, class hierarchies, call graphs of the program. doxygen supports many languages, chances are high yours will be too.

To get the most out of doxygen, I’ve used the following configuration file which enables many of the advanced analysis features (like call graphs etc): doxygen.config.

You have to edit the file and provide - at least - the input and output directories! After that, it’s simply a doxygen doxygen.config.

To generate this kind of documentation easily from Maven, here is a similar doxygen-maven-plugin configuration:

 1     <build>
 2         <plugins>
 3             <plugin>
 4                 <groupId>com.soebes.maven.plugins.dmg</groupId>
 5                 <artifactId>doxygen-maven-plugin</artifactId>
 6                 <configuration>
 7                     <projectName>${project.artifactId}</projectName>
 8                     <projectNumber>${project.version}</projectNumber>
 9                     <optimizeOutputJava>true</optimizeOutputJava>
10                     <extractAll>true</extractAll>
11                     <extractStatic>true</extractStatic>
12                     <recursive>true</recursive>
13                     <exclude>.git</exclude>
14                     <excludePatterns>*/test/*</excludePatterns>
15                     <inlineSources>true</inlineSources>
16                     <referencedByRelation>true</referencedByRelation>
17                     <referencesRelation>true</referencesRelation>
18                     <hideUndocRelations>false</hideUndocRelations>
19                     <umlLook>true</umlLook>
20                     <callGraph>true</callGraph>
21                     <callerGraph>true</callerGraph>
22                     <generateLatex>true</generateLatex>
23                 </configuration>
24             </plugin>
25         </plugins>
26     </build>

The generateLatex option is nice if you wish to produce PDF files (for viewing on a Kindle for example).

With this plugin configured in you pom.xml, mvn doxygen:report is your workhorse.

If you’re unsure if the generated documentation is worth it, take a look at the doxygen documentation of JUnit 4.8.2 (zip file, 7.6MB).

Everybody writes legacy code.
— Eric Ries

Haskell Books and Tutorials

Haskell is really a language very worth knowing. It does many things so different than most other languages which I really enjoy.

I’d like to use this post to mention a few really good resources for learning Haskell:

Learn you a Haskell

Learn you a Haskell is a very fun and entertaining tutorial for Haskell very much in the spirit of Why’s poignant guide to Ruby.

It isn’t finished yet, but it’s really good to start with. (Via Wadler’s Blog)

Real World Haskell

Real World Haskell is a upcoming book I really look forward to. It is freely available on its website, so check it out.

Wikibook

There is also a good collection of Haskell topics on Wikibooks – Programming/Haskell

Wikibook is more interesting if you already know a bit of Haskell and would like to understand it more in depth.

Others

I liked Yet another Haskell Tutorial very much, this aims at people with a bit of background, though.

Another more recent book on Haskell is Programming in Haskell which is nice, but also quite basic IMO.

Of course there is always the Gentle introduction to Haskell

Enjoy.

Haskell is doomed to succeed.
— Sir Tony Hoare

Git Overview

My version control journey started with CVS, after that I looked at SVN, but never really used it. The shortcomings of centralized repositories were too obvious and with my increasing interest in Haskell I jumped on the distributed version control train with Darcs. I really, really liked it, but it had some nasty things too. After a while I was looking for something different and stumbled over Mercurial, again I was really happy with it but somehow my journey wasn’t over yet.

Continue to full post...

Mercurial Version Control Status in the ZSH Command Line Prompt

I sometimes forget to push or pull changes to or from a remote repository. To remedy the problem I wrote myself a little script to show me the status on the prompt.

Continue to full post...

Free and Open Source Math Programs

In this post I’ll go through some of the most prominent math programs available with source code.

This is by no means “original” work, I just collected the headlines and links to various mathematical software projects out there. This started with an offer of my University (Mathematica for 13 Euro), but I don’t want to invest time in a tool I won’t have (free, or almost free) access to for the rest of my life.

Continue to full post...

Emacs Basics

It’s been a while since I wrote my Vim Introduction and Tutorial (exactly one year). A lot happened between now and then, I chose to get a better feeling about Emacs for example.

The reasons aren’t easily explained; The most prominent reason is the awesome AucTex-mode since I’m working heavily with LaTeX lately.

Anyways, learning Vim and Emacs is better than learning only one of them :-).

Continue to full post...

Addendum to "Time Machine for every Unix out there"

My article about using rsync to mimic the behavior of Apple’s Time Machine generated a lot of traffic, and more important, a lot of feedback.

In this article I’ll summarize and try to clarify a few things.

Continue to full post...