Semantic Werks

Thoughts on people, machines and systems.

Posts Tagged ‘climate

Climate models and computing talk with Balaji

leave a comment »

Yesterday the U. Toronto Atmospheric Physics group hosted a talk by Balaji1, head of the modeling systems group at the Geophysical Fluid Dynamics Lab (GFDL) at Princeton University, a school in New Jersey. Balaji’s interests include High Performance Computing (HPC) and computer modeling. His Ph.D. is in Physics but I get the impression he is ‘computationally-oriented’.2 I don’t have his slides, but it looks as though he posts talks on his website.

The gist of his talk was that climate modeling, while of critical importance to the world, is facing several challenges:

  • Scale: increasing the resolution of the model in X/Y dimensions means a factor of 4 increase in problem size, and often we want to increase the temporal resolution as well (for a factor of 8 increase).
  • Climate models need to integrate new science components. For example, we should model hurricane formation by integrating sea-temperature models with atmospheric models. These integrations pose several challenges.
  • Models need to be reproducible, according to current scientific dogma, but this is a tough challenge when a model run can take many days to run, and is subject to hundreds if not thousands of parameter choices. There are research efforts underway to understand what ‘reproducibility’ ought to mean at this scale: is probabilistic reproducibility enough? One challenge is that even understanding the results of the model run can be challenging.
  • More and more, products of climate modeling are being sought as input into other models or decision-making. For instance, policy makers need to know drought predictions for the next 50 years in order to do land-use planning. The problem is that a) there are more of these requests than GFDL, for instance can handle; b) the models are not suitable for these predictive tasks, and require expert interpretation; c) selecting a single model is not desirable when the average of all models gives better results. Balaji mentioned a proposal to create a ‘climate service’, akin to the weather service, for doing this sort of thing.

A few other notes:

Balaji described the FRE, a configuration management system (of sorts) for recording experimental parameters and workflows. This is how GFDL tries to keep track of model runs, and maintain reproducibility. He did mention that the system can still be tweaked at the instance level, so the FRE may not capture everything that was done for that run.

I asked Balaji why these research centres were so intent on building and maintaining supercomputer clusters. After all, it isn’t something they should be experts in. I suggested the real experts were companies like Google and Amazon who routinely operate thousands of processors in data centres around the world.

His response was that they needed the control. The models need control over configuration, for reproducibility (after all, they are interested in bit-level reproducibility); they also needed control over core cycles, so that the models could run uninterrupted. He gave as an example the Department of Energy supercomputer centre, where other needs (more processing intensive) would bump climate models from the queue. Furthermore, he thought it likely that running on Google App Engine, for example, might cost even more than maintaining and running your own cluster.

That answer is understandable, but it does seem solvable. These are essentially business problems that can be negotiated: cost of cpu time, service level guarantees, etc. It’s hard to see how GFDL can compete with Google’s engineers in maintaining and building massive clusters. As an example, DNA sequencing is now taught on Amazon EC2 ‘machines’.

Finally, I would think from a reproducibility standpoint that relying on knowledge of specific machine configurations is way too detailed. It shouldn’t matter to your model that this machine runs VMS 3 while this other machine ran Linux 2.6.24. I know it *does* currently matter; but it shouldn’t.

It was a fascinating talk; he managed to tailor it to the diverse group of people listening in very well. I wonder if Steve Easterbrook will get to visit the GFDL lab as part of his sabbatical research.

BACK 1. I believe this is his last name, but used in the Brazilian fashion, a la Ronaldo/Pele.

BACK 2. I apologize for being sleepy mid-way through!

Written by Neil

2010 June 9 at 09:44

Using Freebase

leave a comment »

I’m trying to learn how to import data into Freebase, a web site that is like a machine-readable Wikipedia, by way of learning how to use Gridworks, a data-cleaning tool. However, there don’t seem to be any tutorials on creating data schemas in Freebase.

The data

I was browsing the BC Apps4ClimateAction site to play with Gridworks, and found the BC Forest Service’s climate data for the biogeoclimatic zones of BC.  I think they are recording things like average annual rainfall for each zone. I thought the general concept of biogeoclimatic zones (BGC) was worth sticking on Freebase, as it forms a terminology that can be re-used in many different settings. A BGC (or BEC) is a geographical area that is defined by similar climatic conditions. E.g., the climatic zone defined by rolling hills of prairie grasses.

The data I’m working with consists of the zone names and subdivisions, along with some common names:

Snapshot of the BGC data model

Snapshot of the BGC data model

There are 229 rows like that.

The model

Now, the important task is to represent this spreadsheet in a linked data schema. There seem to be some common properties: one is the partonomy — well, geographic containment, really — between Zone, Subzone, Variant and Phase. The other relations are descriptive: the long name for the zone, and the description of the particular subzones. Finally, we have a relationship between the zone and something called the NDT Code. I haven’t figured out what that means, but it’s probably worth sticking in.

How does Freebase expect data to be entered?

The meta-schema for Freebase is similar to RDF. There are Types and Properties; each Type belongs to a domain (namespace) consisting of similar Types. A Base contains Types and Properties describing a semantic domain. Topics are instances of (multiple) types and can be related via properties to other Topics.

The Types in this case are the column headings (except the index, of course). The Properties reflect the properties I discussed above. They relate Types together. Then the next question is how to map this model to the Freebase set of properties. Ideally, particularly for these high-level concepts, we would re-use existing relationships.

We also want to describe our Types using existing Freebase types and relations. E.g., we should probably reflect the fact that a BGC Zone is-a Location. The more annotations like this we can include, the more useful our data set should be for people who want to re-use it.

So I found the following mappings:

  • BGC Zone is type Location (http://schemas.freebaseapps.com/type?id=/location/location)
  • The same is true for the other subdivisions.
  • The divisions are related using the contains property: http://schemas.freebaseapps.com/property?id=/location/location/contains
  • The Phase and BGC Zone have alias properties http://schemas.freebaseapps.com/property?id=/common/topic/alias
  • An NDT_Code becomes a new property of a BGC Zone.

Each cell in the table is then linked into this model; e.g., the cell BAFA becomes a Zone topic, with certain instantiated properties (like Contains SubZone 'un').

Using Freebase’s Schema Editor

Now I went to the Freebase Sandbox to actually create the schema. I created a new base, and in that base created my four types and added the Location super-type to each type. A Location naturally has the properties of containment, but I couldn’t understand how to override that property to specialize it further. Eg., my Zone should only contain Subzones, not any Locations. I was therefore forced to create special properties for this base.

A snapshot of the Schema Editor

A snapshot of the Schema Editor

The neat thing about Freebase is that I can add others who can now edit this schema. For example, a forester might have some other properties that can be added to a Zone. Even better, the schema can be re-used in other schemas: for example, using the Type Zone in some species project.

Accessing the new schema

You can check it out at http://bcbiogeoclimaticzones.sandbox-freebase.com/. The next step is loading the instance data into the base using Gridworks — stay tuned.

Update: Apparently ‘sandbox’ means the server is overwritten weekly. Should have anticipated that. Anyway, I have recreated the base at the main site: http://bcbiogeoclimaticzones.freebase.com/

Challenges

  • Evolving schemas: what happens if the Forest Service redefines the data model or renames a Zone? In a linked data context, external parties will rely on my model in order to build applications. That means a change in my naming conventions (e.g., “Zone” vs. “BGC Zone”) will impact them. I’m not sure the Freebase folks have thought about versioning/backwards compatibility.
  • Consistency: how can we check that our model is consistent? Do we care about issues like normalization in the linked triples context? How can I constrain the model? There are some opportunities in the Schema Editor, but we might want cardinality restrictions, logical tests, and so on.

Written by Neil

2010 May 21 at 12:37

Follow

Get every new post delivered to your Inbox.

Join 198 other followers