Quick look at upcoming Parallel Collections in Scala 2.9

Scala 2.9 should be released "in weeks", so I took recent snapshot for quick spin, to see how it work with my open-source project. Upgrade was easy, nothing compared to 2.7. Just replacing scalac and scala-lib jar files. I encountered three minor problems:

1) Case classes are generated different way and autogenerated serialVersionUID is different. So I could not read my old data generated with Scala 2.8. Fixed after forcing old id with '@SerialVersionUID(XXX)'

2) There was change in Application trait, now it behaves much better. However Intellij Idea does not recognize it as launch-able class. I had to create launcher profile by hand.

3) Library jar file has grown yet again, now from 6 MB to 8 MB. Parallel collections are biggest new part. And it is not so easy to strip them out with ProGuard for Desktop/Android apps.

So lets checkout new parallel collections. First little bit of theory. You may be familiar with fork join' framework for Java. Key is 'work stealing' which distributes tasks between processors and reduces concurrency overhead. Scala parallel collections are implemented on top of this framework. More about internals can be found in this presentation, in this paper or at nightly scaladoc.

Parallel collections extends old traits and has same method names (foreach, filter, fold...). There are tons of new parallel traits and classes. Most old traits ('Seq','Map') has its parallel counterparts ('ParSeq', 'ParMap'...). Old collections has new methods to switch to parallel mode ('par', 'toParIterable', 'toParSeq' etc..). There are also mutable and immutable versions.

Not all collection types are parallizable. Collection is divided in half and each part processes separately. This works well for intervals, arrays and trees, but does not work for linked lists and other traversable once collections. There are new tree new essential parallel collection types:

Parallel range of integers. To construct it use 'par' method on regular range. Little demonstration:

val interval2 = (1 to 10000).par.map{i=> 
    //this is executed in parallel
    i*i
}

However there are a few catches. For example 'foreach' is executed only in single thread:

(1 to 100).par.foreach{i=>
    println(Thread.currentThread)
}
//but using 'workaround' works
(1 to 100).par.map{i=>
    println(Thread.currentThread)
    Unit
}   

there is other problem: it seems to works only with integers. With Long, Double or other number type 'par' method is not defined.

Other basic type is parallel array:

Array(1,2,3,4,5).par.map{i=>
    //this is executed in parallel
    i * i
}

However 'foreach' method is still executed in single thread, I guess there is some deeper design rule behind this. Other problem is that ParArray is not optimized for primitive types. So each Integer allocates new object instance. ParArray works well and I did not found other unexpected behaviour.

Third type is parallel hash map. It divides tree of hash buckets into separate tasks. Implementing parallel hash tree is very impressive and truly innovating. But I had no time to test it more deeply.

Now tricky stuff. First collections are parallel, but not 'concurrent'. If your iterations over collection leaves side effect, you need to make sure it is thread safe. There is no 'compiler magic check' to test side effects for you. It is just thin layer on top of 'fork/join' framework.

Secondly you need to have all data in memory, before you start parallel processing. Take this example which filters lines from file:

val lines = io.Source.fromFile("bigFile.txt.gz").getLines.toParIterable
val someNames = lines.filter(_=="Some Name")
someNames.foreach(println)

Most efficient way is to iterate over file and filter lines as you go. But 'toParIterable' method actually loads entire file into memory and constructs 'ParArray' which is then filtered. So there is major different between methods 'par' (constructs parallel collection very efficiently) and 'toPar...' (just makes copy slow way).

Scala 2.9 is not just about collections, for example there is Future which wraps an functions, executes it in parallel and returns its result.

I did not fully integrate parallel collections into my application yet. But I found migration easy. It makes some parts of my application much faster. But it is necessary to have good performance testing on various machines (quad workstation down to netbook), otherwise your app may actually become slower. Also test with huge data and watch memory usage.

Overall I am very impressed. Scala developers integrated 'fork/join' framework into Scala very nicely. Things like ParHashMap and Combiner/Splitter are just 'wow'. I would also recommend to wait for final Scala 2.9 release.

I am worried there may be hype created around this new feature. And lot of people may get disappointed as this is not 'nosql map/reduce running on amazing quantum cluster in other dimension'.




Last modification: April 23 2012

blog comments powered by Disqus