Thursday, 26 November 2009

Pragmatic Scala: Parsing columns from plain text

A few days ago I was preparing for a meeting and I thought I'd print out a list of all the attendees from the Eventbrite event details.

When I looked at the list I saw that it seemed to be in no particular order and didn't print very well so I thought, "I wonder if I could copy and paste this into a spreadsheet and sort it by surname". Unfortunately each entry had one or two lines and variable information in it and was delimited by spaces and commas.

Then I thought, hmm, I could process this quickly with a bit of Scala and pattern matching.

There's nothing clever here, but for me it's quite significant that the effort involved in writing this code was so low that I was able to use Scala to solve this problem there and then and it worked seamlessly. I've posted it here because there are too few simple examples of Scala lying around the web for people to get started with.

So here's the code if you're interested…

exec scala "$0" "$@"


val source = Source.fromFile(new File("Attendees.txt"))
val lines = source.getLines.trim
lines.foreach( line => {
val (details, presentation) = line.split("\t").toList match {
  case List(d) => (d,"")
  case List(d,p) => (d,p)
val (name, position, organisation) = details.split(", ").toList match {
  case List(n) => (n,"","")
  case List(n,c) => (n,"",c)
  case List(n,p,c) => (n,p,c)
  case List(n,p,d,c) => (n,p+ ", "+d,c)
val (first, middle, last) = name.split(" ").toList match {
  case List(f) => (f,"","")
  case List(f,l) => (f,"",l)
  case List(f,m,l) => (f,m,l)


It was parsing lines of text which looked roughly like this:

Firstname [Middlename] Surname, [[Position, [Department ,]]Company][\t Presentation]
(e.g. bits in square brackets were optional).

For those unfamiliar with Scala, the first three lines just allow the code to be run as a script on the command line just like any shell script. Then the lines.foreach( line => code just loops through each line of code setting line equal to that line. The nice bit is the matching parts which assign values to multiple values at once so:

 val (details, presentation) = line.split("\t").toList match {
  case List(d) => (d,"")
  case List(d,p) => (d,p)
effectively says:
  • Assign to the variables: details and presentation at once, the value of the following expression.
  • Take the line and convert it to an array of Strings by separating it at every tab character, then convert the array to a List.
  • Match the List to one of two patterns:
  • If it matches the pattern of a List with one entry then assign the value of that entry to d and return the thing on the right hand side of the =>. In this case it is a tuple (pair of values): d and an empty String.
  • If it matches the pattern of a List with two entries then assign d to the first and p to the second and then return them both as a tuple.

So, at the end we have details set to everything before the tab and presentation set to everything after the tab or an empty string if there was no tab.

However, if there are two or more tabs then none of the patterns match so the program creates a runtime error which is great because we know that the program is either doing what it was meant to or not.

1 comment:

Anonymous said...

Sorry for my bad english. Thank you so much for your good post. Your post helped me in my college assignment, If you can provide me more details please email me.

Google Analytics Alternative