Thursday, August 30, 2012

nSuneido DLR Spike

After getting a C# version of the lexer working I decided to take a stab at the parser.  Suneido uses a hand-written recursive descent parser so it's pretty simple code to port.

I've been doing some reading on the dynamic language runtime (DLR) so before implementing the complete parser I decided I'd push a spike through to actual code generation.

This turned out to be remarkably easy. Rather than tie the parser directly to the DLR methods, I created a ParseOutput interface. For testing I wrote a simple AstNode class and a corresponding AstOutput. Once the parsing was working I added a DyExprOutput class that creates a DLR Expression tree, which you can then compile via Expression.Lambda (see the tests at the bottom of DyExprOutput).

source code at GitHub (the link is to this specific version, you can navigate to the latest)

This only handles very simple expressions - math on numeric integer literals. (It doesn't handle even as much as the partial parser does.) But to get that far in Java took me way more work.

I found one book on this - Pro DLR in .Net 4 - which has helped. There's also quite a lot of material on-line.

Another nice feature is that the DLR is just a library - it doesn't involve any .Net internals. And even better, it's open source on CodePlex. It's troubling that it hasn't been worked on since 2010, but since the "dynamic" keyword in C# uses the DLR, I assume it won't be going away.

So far, I'm still quite positive about C# and the possibility of nSuneido. My primary motivation is that cSuneido is falling behind jSuneido, but I'm really not keen on doing a lot of work on the C++ code. And even if I did, it wouldn't solve the garbage collection issues. And we can't switch to jSuneido for the client because of the current Windows specific GUI.

Sunday, August 26, 2012

nSuneido ?

It's been a long time since I've looked very closely at C# and .Net so when I saw a special on the C# 5.0 Pocket Reference I figured I should catch up. I read it cover to cover, and surprisingly (to me) I was quite impressed. C# has come a long way since version 1. It's certainly moved much faster than Java. In my recent System Programming Languages post I discounted C#, but I'm starting to wonder if that was a mistake.

* "nSuneido" comes from the convention of using 'n' for .Net projects e.g. NUnit, NHibernate, etc.

I also picked up C# in Depth by Jon Skeet (of StackOverflow fame). It was perfect for me because it covered the evolution of the language, not just how to use it.

One concern with C# and .Net is portability. But Mono seems to be in good shape and in active development. So far all my playing with C# has been on the Mac with Mono and  MonoDevelop. Of course, Mono lags somewhat behind - it's on .Net 4.0, whereas the latest is .Net 4.5. And it's performance isn't quite as good, but it's not bad.

There are a lot of things I like about C# and .Net (in no particular order)
  • It meets my mandatory entry level requirements of safety and garbage collection
  • Unlike Java, it has value types (in addition to reference types)
  • Like Java, .Net has a VM and IL (byte code) that Suneido can target
    (whereas with native languages like C++ Suneido has to implement its own VM)
  • Stable, robust, performant platform
  • Rich ecosystem of books, tools, libraries, developers, etc.
  • Easier to find programmers who know C#
  • Extension methods 
  • Class members and fields default to private 
  • Decimal floating type (one less thing to implement)
  • Lambdas :-)
  • Delegates (method handles) 
  • Named and optional arguments
  • Type inference (var) although only for local variables
  • Dynamic language runtime (DLR) 
  • Good access to Windows API's (PInvoke)
  • Annotations
  • Anonymous types and tuples
  • Operating overloading
  • Nullable (a bit like Scala Option)
  • Unsafe code if needed
  • Good IDE support
  • Generics on unboxed primitives (unlike Java)(no need for Trove collections)
  • Reified generics (unlike Java)
  • LINQ
There are, of course, some things I don't like. I'd really prefer optional semicolons (like Scala and Go), but I can live with them. It's taking a little getting used to capitalizing public methods and fields even though this is similar to Suneido. I miss Java's static imports - there's no way (that I've found) to have something you can call as simply Foo(). It always has to be Something.Foo()

With Java I could get almost all the tools and libraries I needed open source. With C# it seems a lot more commercial. Visual Studio and popular add-ons like ReSharper and Reflector are all commercial products.

I'm maybe just not up to speed on the C#/.Net ecosystem, but it doesn't seem like there aren't as many good libraries available. I would miss Guava. I need to learn more about what is and isn't available in the standard .Net libaries.

Once more, I started by implementing the Suneido lexer. This is turning into a coding kata for me, but I still find ways to improve or simplify each time I re-implement it. I took advantage of a number of C# features like extension methods and even lambdas. The code is on GitHub along with the D version. By the time I finished this exercise in D I had lost much of my enthusiasm and was somewhat disillusioned. In contrast, after the C# implementation I was still quite keen. In the long run, that probably doesn't mean much, but it's a good sign. As much as we might like these decisions to be objective it often comes down to subjective factors.

Ideally I would have just a single implementation of Suneido. (There are sometimes advantages to multiple implementations, but it's a lot more work.) Although the Java version is great for larger servers, it's not well suited to the current Windows GUI. And it's not great for smaller servers (e.g. smaller EC2 instances). An intriguing possibility is using Scala on both the JVM and .Net. I really like Scala, and the idea of writing one implementation that could be compiled to run on either the JVM or .Net is very attractive. The .Net version of Scala isn't production ready yet, but it's something to keep an eye on.

LastPass + Safari + Mountain Lion problem solved

Since I upgraded to Mountain Lion, every time I open Safari I got: "Safari 6.0 has not been tested with the plugin LastPass 1.73.0". Except I wasn't using LastPass in Safari and 1.73 is a really old version. Safari seemed to work fine, but it was annoying, even though I don't use Safari as my primary browser (I mostly use Chrome, and Firefox for LogMeIn). In case anyone else out there has the same problem, I'm posting how I solved it.


I tried to track it down several times with no success. Safari > Preferences > Extensions didn't show anything. I couldn't find anything useful searching on the web. Searching for "lastpass" files didn't uncover anything. I tried deleting extensions.plist and the Safari cache.

Finally I saw a suggestion (for an unrelated problem) to download LastPass and run the uninstaller. The closest to 1.73 I could find was LastPass 1.75 on the LastPass old downloads page. I opened the dmg and run the uninstaller, and that solved the problem. Now Safari opens with no error messages. I still don't know where the extension was lurking though! (I even checked the Trash after running the uninstaller, but couldn't see anything.)

Wednesday, August 22, 2012

D-etractions

I love the potential of the D language, but I have to admit I'm becoming a little disillusioned with the current reality. This is of course, extremely subjective. I don't want to get in a fight with D supporters. I agree it's a great language. To me, most of the issues revolve around a lack of maturity and a small user base. If I was just playing, that wouldn't be a big deal. But for an implementation language for Suneido, I'm looking for something rock solid. Java may be old and boring and "weak", but it's solid.

One of the big issues for me is the lack of tools, especially good IDE support including refactoring. You know this is unlikely to be a priority when the main developers are old school text editor users. (In my mind I can hear them saying "we don't need no stinking IDE") Scala had the same problem until recently, also due to the main developers not using IDE's. But recently they've made IDE support more of a priority. Maybe not so coincidentally, this coincided with the formation of a commercial entity (TypeSafe) promoting Scala. I've done my share of old school editor + command line programming, but to me, a large part of the benefit of statically typed languages is that they allow powerful IDE manipulations.

A similar issue is the meager ecosystem e.g. books, libraries, tools, etc. I look at the libraries and tools I use with Java and few, if any, are available for D.

One thing that makes me a little nervous is D's fairly simple (from what I've seen) garbage collector. When I think about the effort that has gone into the JVM garbage collectors, I wonder how D will compare. D's garbage collector is also conservative in at least some situations (I'm not sure of the details). Based on my experience with conservative garbage collection in cSuneido, this increases my nervousnous.

D's templates and compile time function evaluation, and mixin's also worry me. They are very powerful and very cool, and they might be way better than C++ templates, but they're still complex. Of course, it's the usual story, you judge a new language when you're a newbie, so you don't necessarily have the basis to form a good opinion. But we seldom have the time to become expert before making a decision. I do have a fair amount of experience with C++ templates to judge by. Templates, as in D and C++, are undeniably powerful. Much more so than the generics in Java or C#. But I wonder if they are on the wrong side of the power / complexity balance. And these advanced features appear to be absorbing much of the energy of the development, to the neglect or even detriment of the overall maturity.

I love the potential of explicit "pure" and "immutable". I wish other languages like Java and C# had more of this. But from the meager exposure I've had, the reality is not as nice. I'm sure some of that is simply learning curve. And some of it may be improved by future language improvements. But when you can't put immutable values in a container without wrapping them in a "Rebindable" template, that again makes me nervous.

Of course, you could use a subset of D and try to ignore templates etc. but this is hard psychologically (I got sucked into mixins just to implement lexer tokens!), and also because the libraries make heavy use of templates. Then you'd primarily be consuming templates, not writing them, but you still need some understanding even to use them.

One of my concerns with Suneido implementation is making it easier for other programmers to get involved in the implementation. (i.e. improving the bus factor) In that respect, I think the jSuneido Java code is much better than the cSuneido C++ code (with templates!). And it's probably easier to find Java programmers than C++ programmers, let alone D programmers. (Of course, the challenge of finding good programmers remains.)

I plan to keep an eye on D and hopefully continue to play with it a bit. I wish the project good luck. But for now, I am going to put on hold any ideas of using it to replace cSuneido.

See also: D-tours and System Programming Languages

Saturday, August 18, 2012

Btree Range Estimation

Suneido's database query optimization needs to select which indexes to use. Part of this is estimating what portion of a table is spanned by a given range of index keys.

My assumption in the past was that a crude approximation was sufficient, since usually the best index is obvious. So the estimation would only look at one or two of the top levels of the btree.

But recently, as a result of some performance work on our applications, we've found a number of cases where this isn't true, where you need a more accurate estimate in order to choose the best index. This is usually with large tables, e.g. millions of records. With big tables like this, only looking at the top one or two levels of the tree is not sufficient to differentiate e.g. an index range of a few records versus an index range of thousands of records. In this situation, the "wrong" choice can make quite a difference to performance.

One of the sources of inaccuracy is from estimating the size of a Btree level. So my first idea to make it more accurate was to look at all the nodes on each of the top levels in order to get an exact count. I implemented this, and it was an improvement, but the downside was that it could end up reading a lot of nodes because Btrees have a large fan-out.

So I went back to my old approach of doing searches for the beginning and ending values of the range, and estimated the level sizes and therefore the portion of the tree between them. I made a better estimate of the level size by basing it on the size of the nodes passed through on the search. Since this approach only does two searches down the tree (not reading whole levels) it's reasonable to go further down the tree to get a more accurate estimate. In the end, I found some cases required searching all the way down the tree to the leaves. This is ok, since Btrees are shallow due to their large fanout. (Even going all the way to the leaves is still not 100% accurate because you're still only approximating the level sizes on the way down.)

This was better, but there was still an annoying flaw - because node sizes vary, if a search happened to pass through smaller nodes, it would estimate different level sizes than a search that happened to pass through larger nodes. Different level size estimates would result in poorer estimates. Differences in level size estimates near the top of the tree were especially bad since their effect was multiplied. For small ranges (e.g. "close" values) this wasn't a problem because the two searches would pass through the same nodes at the top levels of the tree and therefore estimate the same level sizes. But larger ranges could be less accurate.

The improvement I came up with was to use the average of the search's estimates of level sizes. Not only does this give a better estimate, but it also means that the two searches use consistent level size estimates, which is just as important as the estimates being accurate. It was a little more awkward for coding since the averaging had to be done in the middle of the searches. I originally wrote the code as recursive, but ended up with an iterative approach since it was shorter and simpler. (Not because it might be faster.)

This may be a known technique, but I haven't encountered it so I thought I'd share it. If anyone is aware of better approaches, let me know.

jSuneido btree source (scroll to the end for the rangefrac methods)

Monday, August 06, 2012

Security

A friend recently had their Gmail account hacked. They had a strong password, but they admitted they had used the same password for other sites. Luckily, they discovered it quite quickly and it doesn't appear that any harm was done.

It's a good idea to use a separate, strong password for your email account because your email account is the "master key" to all your other online accounts. (most password recovery mechanisms use email) I do sometimes reuse the same password, but only for unimportant things like forums or bug trackers which don't have credit card numbers or any other critical information.

This incident gave me the push to get around to setting up 2 factor verification on my Gmail account, something I've been meaning to do for a while.

It wasn't too painful. I installed the Google Authenticator app on my iPhone so I don't have to use SMS.  I also set up my LastPass account to use the Google Authenticator as well.

Most people can't even manage to use decent passwords, let alone deal with 2 factor authentication. But if you can put up with the small extra hassle, it's probably worth it.

Also on the security front, I recently decided to install anti-virus software on our Mac's. There's still debate on whether this is necessary, but with Apple's growing popularity, they're becoming an increasingly attractive target. I picked the free version of Sophos to start. It's always hard to tell how good anti-virus software really is, but this was easy to install, hasn't had any noticeable affect on performance, and is completely unobtrusive. Of course, it's yet one more piece of software running on your machine, and can't help but slow down startup.

Sunday, August 05, 2012

D-tours

I've been continuing to spend a little time here and there playing with the D language.  For a sample project I decided I would try implementing Suneido's lexical scanner. It's pretty simple, just a little string and character manipulation so I thought it would be easy.

I based the D version on the Java version since it's newer and cleaner than the C++ version.

The first thing I ran into was that the lexer returns "tokens", which in the Java code are not just integer enums, but actual enum classes with fields and methods. Java went from no enums in early versions, to quite sophisticated enum capabilities.

In typical new user fashion, I tried to reproduce the Java style token enum in D. (It's hard not to avoid "cutting against the grain" when learning a new language.) D does allow you to use structs for enums so it seemed this would work. But if you use structs then you lose the automatic assignment of consecutive integer values, and D struct's are value types so they aren't necessarily unique.

Then I went off an a tangent looking into D's "mixin" ability. D has the ability to run D code at compile time, generate a string, and insert that string into your source code. This is known as CTFE (compile time function evaluation). For example, you can write:
mixin(myfn(arguments...);
where myfn returns a string containing source code to be compiled into your code in place of the mixin "call". This is easier than trying to use C++ templates to do compile time calculations, but not as seamless as Lisp macros because you have to work at the level of strings.

An example of the power of this is the Pegged library for D. This allows you to include a grammar specification in your source file, have a parser generated from it, use it to parse DSL code also included in your source file, and then use the resulting parse tree to generate D code - all of this at compile time, without requiring any external tools.

It was fun looking at what you could do with mixin but it was overkill for what I was trying to do. I realized I could use another feature of D to use simple integer enum tokens, but still work with them as if they were objects with properties.
enum Token { IF, ELSE, AND, OR, ...
bool isKeyword(T token) { ...
D allows you to call functions like isKeyword as if they were methods on their first argument, so you can say token.isKeyword() even though token is an int, not an object.

This also led me to discover D doesn't have a "set" data structure, which surprised me a little. It does have hash maps built-in to the language, which can be used to emulate sets, and it has a red-black-tree implementation which could also be used. But it doesn't have an actual "Set" class. It was easy enough to throw together my own Set so I could make a set of keywords. (Which I didn't end up using because what I really needed was a map of keyword strings to tokens.)

What I initially wrote was:
static auto keywords = new Set!Token(IF, ELSE, ...);
But that won't work because static initializers can only be simple literals. However, you can write:
static Set!Token keywords;
static this() { keywords = new Set!Token(IF, ELSE, ...); }
I'm not sure why the compiler can't do this same translation as simple syntactic sugar. It's extra annoying because you can't use the "auto" type inference.

You could do it with a mixin like:
mixin(stat("keywords", "new Set!Token(...);");
but that's not exactly pretty.

Another minor annoyance was the lack of an equivalent to Java's static import. If I want to refer to the tokens as just IF or ELSE I could make them "anonymous". But then they don't have a specific type. Or I can give the enum a named type, but then I have to always reference them with Token.IF . In Java I could say import static Token.* and then just use the bare names.

I ended up going back to a mixin approach with each Token a static class. I used a class rather than a struct because I wanted to pass tokens around by reference, not by value. See token.d

I also ended up writing multiple versions of the lexer. As with most rewrites, I found a number of improvements I could make. Most of them aren't specific to D, I could apply them to the other versions as well.

After getting it working, I decided to go back and write it in a functional (rather than object oriented style). In this version the lexer is a function that takes a string and returns a struct with the token, its string value, and the remainder of the source string. This version should work with Unicode UTF8 strings, not just ASCII, because I only access the source string with front and popFront.

Along the way I also started a simple version of Hamcrest style asserts. I almost expected something like this to be available, but I couldn't find anything.

One of the features I was excited about with D was immutability. But when I tried to make Token immutable, I ran into all kinds of problems (as others have also). One of them being that you can't store immutable objects in a map! Hopefully this is something they'll figure out. I eventually gave up on immutable and managed to get const to work, although even that was a struggle.

I set up a dsuneido GitHub repository for my experimenting. I haven't used GitHub before, but I keep hearing how great it is so I thought I better give it a try. I used their Mac GUI app, which made it quite painless. However, I'm not sure it's the smartest idea to be using three different version control systems - Subversion for cSuneido, Mercurial for jSuneido, and Git for dSuneido. I chose Mercurial for jSuneido because they had better Windows support than Git at that time, but now that Eclipse comes with built-in Git support, I wonder if I made the right choice.

I'm not sure where I'm going with this, so far I'm just playing and learning D. So far, I like it, but it certainly has some rough edges.