Singpolyma

Archive of "Haskell"

Archive for the "Haskell" Category

On Language Extensions (in Haskell and Elsewhere)

Posted on

This was originally an email to the Haskell web-devel list and was also posted on Reddit.

Somebody claiming to be yi huang wrote:

Why prefer code generation over template Haskell? Isn’t them essentially the same thing, and template haskell is performed automatically.

Also, from Reddit (nicolast):

I never understood why someone would want to avoid using language extensions which have been in GHC for at least some time. The only reason I can think of is: compatibility with other compilers. But is anyone ever going to compile/run a yesod-routes based application using something other than GHC?!

First off, yes, Template Haskell is very similar to code generation. There are a few reasons I would like to avoid it.

It’s a language extension. I try to avoid those in general, in every stanardized language I code in (C89, R5RS, Haskell98) for several reasons.

As nicolast said, compatibility with other compilers is a big one. When I get a piece of code from someone who assumed that “what MSVC does” or “what Racket does” is the same as “anyone can run this”, it makes it quite difficult to use my favourite implementations of those languages. I don’t want to make assumptions about other people’s environments, or what will be useful in the future. Maybe someone writes a Haskell interpreter that makes use in some context I haven’t even imagined much nicer. Who knows.

Additionally, any other static analysis/code processing tools (like, say, hlint) *also* needs to support whatever syntax extensions you’re using (semantics extensions may or may not apply here, depending on the nature of the tool). Requiring that every tool author support all my favourite extensions limits my tool options, and makes life harder for tool authors (since they cannot just look in one place for the spec and write to that anymore if they need to look up compilers’ extensions as well).

Will anyone ever compile/run/analyze a yesod-routes based application using something other than GHC/hlint? (Actually, does hlint support TH? It might.) What specifically about yesod-routes makes this less likely? What drew me to Yesod.Routes.Dispatch was its relative purity in terms of extensions/dependencies, etc.

Additionally, I find Template Haskell specifically (and some other language extensions, like Overlapping Instances) can make code harder to read (for me) and possibly harder to reason about. A code generator makes a file that I can read for comprehension, edit if I want to, etc.

Ok, that’s a bit of a long answer to a short question, but it sort of sums up my motivation vis extensions in general and TH in particular.

Let’s Talk About Strings

Posted on

A string is a sequence of letters and numbers. It is text. Or, is it just any array? The term ‘string’ has been used and abused over time to mean many different things. This has lead to confusion, bugs, and a plethora of conflicting opinions.

The Two Main String Types

There are two things people mainly mean when they say ‘string’. On the one hand, they may mean a data structure representing human-language text. On the other hand, they may mean whatever data structure or type their programming environment often uses for representing human-language text. In many cases, this data structure is also used for other sorts of data, such as raw binary data.

This has lead to much confusion.

In some programming environments, like C, everything is low-level enough that there cannot be expected to be One True Way to handle a high-level concept such as “human-language text”. Other environments, however, should know better.

Encodings

An encoding, in the context of text, is the scheme whereby text is represented in actual bytes. I will talk about three different kinds of encodings: internal, input, and output encodings.

Internal Encoding

In low-level languages (and, sadly, many high-level languages) there is no native support for the high-level concept of ‘text’. The programmer is given a way to represent an array of bytes (also called a byte string) and must decide how to encode text in RAM for his/her own application. This is true of C’s char*, Ruby 1.8’s strings, and many others. Historically, ASCII was the de-facto internal encoding, but this is not generally acceptable if your application will be used for anything except a subset of English. Different internal encodings have different trade-offs, which I’m not really interested in covering here. As with so many other things you probably should not be making this decision. Find a library or a programming environment that handles text and let it deal with how to actually store the data in RAM.

Input Encoding

Input encoding is the only encoding an application programmer should ever have to deal with. Unfortunately, for historical reasons, there are a plethora of encodings out there. You will have to decide a way to know how the file, network socket, or other input you are reading is encoded. Many protocols and file formats have simple markers that will let you know. You have to feed this information to whatever calls you use to get your high-level text representation from the input.

Output Encoding

This is the encoding you use in files you write, prints to stdout, bytes you send over network sockets, etc. If you are writing some existing file format or protocol, you need to see how that format handles specifying what encoding you are using and correctly mark it. This, however, should be the extent of what you need to do to handle output encodings. I’m going to say something some people consider controversial, but I’ve given it a lot of thought, and after working with this stuff in many contexts I’ve come to a conclusion.

There is only one acceptable output encoding, and it is UTF-8.

Always.

There are some people around who will bad-mouth Unicode for some of the problems it had historically (for awhile they used 16-bit-max-width encodings and could not properly handle some Asian languages), but these have been fixed for some time now. The standard is not set in stone, so if bugs are found they can be fixed.

The other thing people complain about is that UTF-8 is unfairly smaller for English text than it is for other text. It is true that language-specific encodings will take less space than UTF-8, however the complexity and potential for bugs that comes from using a plethora of encoding mechanisms as a hobo compression mechanism is not worth it. If you’re concerned about space, then compress your content.

UTF-8 is backwards-compatible with ASCII implementations, such that they will continue to work in a UTF-8 environment (at least as much as they ever worked). This also means that your application can easily handle old ASCII data even if all you implement is UTF-8.

Haskell and Ruby

As a way to illustrate this topic, I will take as examples Haskell and Ruby 1.9.

In Ruby 1.9, strings have an associated encoding. They are objects with a byte string and an encoding property. When you write them out, they are written in whatever encoding you specify. This is a huge improvement over 1.8 (where all encoding decisions were manual), but still exposes more complexity than is necessary to the programmer.

Furthermore, all I/O operations in Ruby 1.9 are done using String. How is binary data read, then? String has an associated encoding called ‘binary’. This is just shameful. The programmer still has to keep track of which String are text and which are byte strings.

In Haskell, the default String type is actual an alias for [Char] (a list of Char) and Char is defined to be a 32-bit Unicode code point. It is unequivocally for text. Binary data can be represented as [Word8] (a list of Word8, that is, a list of bytes).

Because linked lists are not always the most efficient, there are also convenient array-packed-representation libraries for both of these types, intuitively named Text and ByteString.

Unfortunately, because of the confusion that often comes from the shameful state of so much tooling, many Haskellers use ByteString.Char8 to store what should be Text, and handle the internal encoding themselves (often poorly).

I/O in Haskell can easily be done directly with any of these types.

Summary

  1. Use a text library for text, use a raw byte string type for binary data. Do not confuse the two.
  2. If you’re reading an existing format, you’ll have to correctly detect the input encoding and transform to the datastructure used by your text library.
  3. If you’re writing an existing format, you’ll have to correctly identify what output encoding you’re using.
  4. You should only use UTF-8 as your output encoding.

Haskell for Rubyists

Posted on

In the last year I’ve been playing with a new language: Haskell. I have found it to be a very suitable second high-level language for me as a Rubyist, and in this post I will explain (with examples!) some of why I, as a Rubyist, love Haskell, and how you can use it to do awesome things.

Why Another Language?

One thing I wasn’t sure about for a long time was if I even needed another language. Obviously, being able to work in any environment you have thrown at you is an essential job skill, but I mean a language I chose for myself. I am a Rubyist, do I need to be anything else?

Well, while I love Ruby, there are a few reasons I eventually chose to go in search of a second language to call my own:

  1. Performance

    Ruby implementations are getting faster all the time, but we all know there are faster things out there. It’s the reason some things (like the JSON gem) have versions written in C. I felt it would be nice for some tasks to get the performance boost that comes from an optimising compiler, without having to drop all the way to C.

  2. Portability

    Yes, Ruby is super-portable… to systems that have a ruby implementation. It’s somewhat complex to just email arbitrary users a ruby script and hope they can use it without setup help.

  3. Linking with native code

    Ruby extensions and FFIs exist so that we can call C code from Ruby. What about calling Ruby code from C or another language? It can be done, but only if most of MRI is linked in to the target and the Ruby is more-or-less “eval’d”.

In case you haven’t guessed, basically I wanted a nice high-level environment like Ruby, but with a native code output.

Isn’t Haskell Hard?

No. Or at least, not harder than any other language. It is true that the Haskell community has a higher-than-average concentration of Ivory Tower Dwellers. Yes, some of them have been in the Tower for so long that they have forgot how to write anything but symbols from higher-order logics. Yes, the documentation for some really nice Haskell libraries and features are dense academic papers. Don’t let them scare you off. There are humans in the community as well, and # on freenode IRC has many of them.

Type Inference

One of the nice features of Ruby is the type system. If you’re used to un-inferred static typing (read: C) then the ability to write code like this:

def fun(a, b); (a + b) * 3; end

is liberating. Haskell has a static type system, which means that you’ll never have a program crash in production because you’re passing in different data than you though, but only in a case your tests didn’t catch. Unlike C, however, Haskell’s system is strong (which means that data is not magically cast for you, so you get stronger guarantees, just like how in Ruby we must write 1.to_s + "hello" not 1 + "hello"), but more importantly it is inferred, so the equivalent of the above in Haskell is:

fun a b = (a + b) * 3

You can add type annotations (like in C) if you want to, which sometimes helps for clarity, but you don’t need to.

The only limitation here is that data structures are mostly of a single type, for example in Ruby:

a = [1, "hello"]

is perfectly fine. This is sometimes a good thing, and sometimes causes strange bugs. In Haskell, this would be an error, so we need to define unions explicitly:

data StuffInMyList = I Integer | S String
a = [I 1, S "hello"]

A small pain, but I feel it’s a fine trade-off.

Mixins

The mixin module is one of the defining characteristics of Ruby. Haskell has something similar, called Typeclasses, which form the foundation of polymorphism in the language. In Ruby:

module Equality
def equals?(b); self == b; end
end

class Thing
include Equality
end

In Haskell:

class (Eq a) => Equality a where
	isEqual :: a -> a -> Bool
	isEqual x y = x == y

data Thing = Thing deriving (Eq)

instance Equality Thing

This looks a bit different. You’ll note I had to give a type signature to the isEqual function. This is one of the few places you have to, and it has to do with making the polymorphism we get with mixins a bit safer. My Equality mixin has to be restricted to types from the Eq typeclass (because I use == on them), which is also true in Ruby except that in Ruby every single class has == defined.

Significant Whitespace

Haskell has significant whitespace. If you’re a Rubyist on the run from Python this may scare you, but there are two reasons this does not bother me. First, the Haskell whitespace is much nicer than in Python, and the way code gets written in Haskell you rarely have the “where does this huge block end?” problem. Second, the whitespace in Haskell is optional! Here’s that typeclass again, but without the whitespace use:

class (Eq a) => Equality a where { isEqual :: a -> a -> Bool; isEqual x y = x == y; }

Great!

Let’s see a real example!

You may have heard that Haskell I/O is weird, and that Haskell has no access to mutation. While Ruby code is often non-destructive in nature itself, access to mutation is sometimes handy. Understanding why Haskell I/O is safe and such is not terrible, but it does take learning a new concept (called Monads, with roots in those academics, but there are good simple explanations out there without too much math, like in Learn You a Haskell (for Great Good), which I recommend), but doing simple I/O is actually not complicated.

main = do {
text <- readFile "somefile.txt";
print $ length $ lines text;
}

This is the Haskell code to read a text file, split it in to lines, count the number of lines, and print out that number. Pretty simple!

What about mutation? Well, it is true that there are no globals in Haskell, but really, who uses globals? If you really need mutation for something, the simplest way to make a reference is:

import Data.IORef

main = do {
someRef <- newIORef 1;
val <- readIORef someRef;
print val;
writeIORef someRef 12;
val <- readIORef someRef;
print val;
}

Of course, if you want you could make this a bit less verbose:

import Data.IORef

x := y = writeIORef x y
new x = newIORef x
get x = readIORef x

main = do {
someRef <- new 1;
val <- get someRef;
print val;
someRef := 12;
val <- get someRef;
print val;
}

Many Libraries

Haskell has a very active community that has produced many libraries covering all sorts of cases. The main place to look for these is Hackage.

REPL

Another thing that drew me to Ruby initially was irb. The ability to just fire up a shell-like environment and enter expressions, and load in my code and play with it live, is a very nice thing. There are several such environments for Haskell, the one that I prefer is GHCI, which also has commands to set breakpoints and such (which I have never needed) and to find out what the type of some expression is (very handy).

Other Useful Bits

There is a very useful tool for Haskell called hlint, which analyses your code and make (sometimes surprisingly insightful) suggestions. I don’t always agree with it, but it is very nice.

Debug.Trace is a very useful library for printing out arbitrary values from anywhere in your code without otherwise affecting the behaviour of the code. Very useful for debugging.

If you want to learn more, I highly recommend Learn You a Haskell for Great Good.