Using Freebase
I’m trying to learn how to import data into Freebase, a web site that is like a machine-readable Wikipedia, by way of learning how to use Gridworks, a data-cleaning tool. However, there don’t seem to be any tutorials on creating data schemas in Freebase.
The data
I was browsing the BC Apps4ClimateAction site to play with Gridworks, and found the BC Forest Service’s climate data for the biogeoclimatic zones of BC. I think they are recording things like average annual rainfall for each zone. I thought the general concept of biogeoclimatic zones (BGC) was worth sticking on Freebase, as it forms a terminology that can be re-used in many different settings. A BGC (or BEC) is a geographical area that is defined by similar climatic conditions. E.g., the climatic zone defined by rolling hills of prairie grasses.
The data I’m working with consists of the zone names and subdivisions, along with some common names:
There are 229 rows like that.
The model
Now, the important task is to represent this spreadsheet in a linked data schema. There seem to be some common properties: one is the partonomy — well, geographic containment, really — between Zone, Subzone, Variant and Phase. The other relations are descriptive: the long name for the zone, and the description of the particular subzones. Finally, we have a relationship between the zone and something called the NDT Code. I haven’t figured out what that means, but it’s probably worth sticking in.
How does Freebase expect data to be entered?
The meta-schema for Freebase is similar to RDF. There are Types and Properties; each Type belongs to a domain (namespace) consisting of similar Types. A Base contains Types and Properties describing a semantic domain. Topics are instances of (multiple) types and can be related via properties to other Topics.
The Types in this case are the column headings (except the index, of course). The Properties reflect the properties I discussed above. They relate Types together. Then the next question is how to map this model to the Freebase set of properties. Ideally, particularly for these high-level concepts, we would re-use existing relationships.
We also want to describe our Types using existing Freebase types and relations. E.g., we should probably reflect the fact that a BGC Zone is-a Location. The more annotations like this we can include, the more useful our data set should be for people who want to re-use it.
So I found the following mappings:
- BGC Zone is type Location (http://schemas.freebaseapps.com/type?id=/location/location)
- The same is true for the other subdivisions.
- The divisions are related using the contains property: http://schemas.freebaseapps.com/property?id=/location/location/contains
- The Phase and BGC Zone have alias properties http://schemas.freebaseapps.com/property?id=/common/topic/alias
- An NDT_Code becomes a new property of a BGC Zone.
Each cell in the table is then linked into this model; e.g., the cell BAFA becomes a Zone topic, with certain instantiated properties (like Contains SubZone 'un').
Using Freebase’s Schema Editor
Now I went to the Freebase Sandbox to actually create the schema. I created a new base, and in that base created my four types and added the Location super-type to each type. A Location naturally has the properties of containment, but I couldn’t understand how to override that property to specialize it further. Eg., my Zone should only contain Subzones, not any Locations. I was therefore forced to create special properties for this base.
The neat thing about Freebase is that I can add others who can now edit this schema. For example, a forester might have some other properties that can be added to a Zone. Even better, the schema can be re-used in other schemas: for example, using the Type Zone in some species project.
Accessing the new schema
You can check it out at http://bcbiogeoclimaticzones.sandbox-freebase.com/. The next step is loading the instance data into the base using Gridworks — stay tuned.
Update: Apparently ‘sandbox’ means the server is overwritten weekly. Should have anticipated that. Anyway, I have recreated the base at the main site: http://bcbiogeoclimaticzones.freebase.com/
Challenges
- Evolving schemas: what happens if the Forest Service redefines the data model or renames a Zone? In a linked data context, external parties will rely on my model in order to build applications. That means a change in my naming conventions (e.g., “Zone” vs. “BGC Zone”) will impact them. I’m not sure the Freebase folks have thought about versioning/backwards compatibility.
- Consistency: how can we check that our model is consistent? Do we care about issues like normalization in the linked triples context? How can I constrain the model? There are some opportunities in the Schema Editor, but we might want cardinality restrictions, logical tests, and so on.
Case studies in requirements engineering research
I maintain (occasionally) a list of common case studies in the RE literature. These are extended examples or model problems, really, which can be used to compare various formalisms. I have links to the data and academic literature.
A common complaint in reviewing research papers is lack of real-world evaluation. But before we get to the thousands of requirements common in industry, we ought to also verify our approach with frequently used examples from the existing literature. I feel that ‘real-world’ is a substitute for voluminous; but often the problem is not (just) the sheer scale, but the interesting edge cases. For that job, I prefer we evaluate our work on smaller, easily understandable model problems.
The page is maintained at the software group’s web page, here.
Should we care about evidence-based software engineering?
- The field with a long history of evidence-based practice, and the most to gain from it, medicine, often doesn’t adopt the recommended practices, or the evidence chosen is irrelevant. Despite hand-washing or checklists being shown (proven?) to be very cost-effective practices to adopt, doctors still leave washrooms without cleaning their hands, and instruments still get left in patients. And in most software projects, there isn’t anything like that sort of liability.
- People don’t understand statistical generalization very well. Is that new pill reducing my risk of heart disease 20% more than the other pill, or 20% more than a regimen of Big Macs? Was this experiment done with non-English speakers? There’s a lot more to it than running a few t-tests and calling it a day. See e.g. “Why most published research findings are false” or a series of critiques on fMRI studies.
- Small results don’t say much. A lot of research is evaluated on small numbers of undergrads or focused on one particular organization (pdf). That evidence is useless to most developers. There is a paucity of in-depth, detailed case studies that generalize to meaningful theories. Personally I am in favour of a moratorium on experimentation in software research until more of these case studies are done. Unfortunately, the lure of the easy number is a Siren-call to reviewers and funding agencies.
- SEMAT to the contrary, there is no good body of software theory that would provide explanatory power to go along with results. Without a theory facts are descriptive; with a theory they can be predictive.
- It simply isn’t that important. Individuals and organizations do many things which research suggests is downright insane — like embarking on projects without clear requirements, or maintaining 30 year old mainframes — and get by. In fact, anecdotal evidence suggests that many excellent companies started with poor practices, then refactored as needed. Probably, this is because evidence-based software development is a case of premature optimization. For example, despite reams of studies suggesting model-driven development is the way of the future, industrial adoption is underwhelming. Is it because they haven’t read the studies? Or that they evaluated the technology and concluded it wasn’t necessary? As academics, we tend to undervalue the benefit of anecdote and gut feelings. Most of the time this is probably correct, but only if we have evidence to support generalization to common scenarios. Most developers were so burned by the CASE tools of the 1980s that they have no interest in repeating the experience with UML.
I think my final point is that rationality is the exception, rather than the rule, in human behaviour. There’s no reason to lose any much sleep over the fact that industry isn’t following evidence-based software practices.
p.s. I’m a complete hypocrite with respect to experimentation.
How do you write an algorithm?
It would be interesting to see how different people went about implementing a solution to a particular well-defined yet complex problem.
For example, I’m trying to encode a version of Tabu search for a particular domain. I have the generic Tabu idea – tabu list, random choices, periodic restarts – but now have to apply that to the specifics of the problem. As we know, these implementations can vary greatly depending on constraints such as the data structures that are already in the solution.
What I’ve done is start with the pseudocode from various publications on Tabu, and some of the projects I’ve looked at. Then I immediately set out to code it all at once, get something working that I could try. That failed, in that I only got halfway before realizing I’d made some design decisions poorly. Do you think experts would not get stymied at this point?
My next approach was to write it out in the language of my domain, as opposed to the implementation language, and to generate a small set of test cases that would simulate the larger problem without confusing the issue of the actual algorithm. As an aside, I really like the Ruby RSpec concept, where you write, in fairly natural language, what you are trying to do. Then iteratively fail until the spec is satisfied.
I think the problem with trying to write directly in code is that for complex algorithms (whatever your definition of complex is), I’m not sure you can easily grasp all the ramifications of your design. Probably some people are able to hold multiple cases in their head at once, and ‘see’ the solution. I would love to see a study looking at this question.
How might we test this? fMRI comes to mind, although it may be of questionable statistical value. Simple IQ test questions might reveal something, but I don’t know if IQ tests capture this idea of problem complexity. Code competitions probably come closest, but a) they don’t capture the thought process well (although there are fascinating screencasts at TopCoder); b) I get the sense that the problem statements are a little divorced from truly complex domains (“solve the fox and the chicken problem without using recursion”) – although I see they have a ‘software conceptualization’ component.
We really want to see if the person can account for multiple possibilities at once. Chess might be a good model, as there are so many branches (maybe too many). The problem is really to achieve a single goal while conceptualizing multiple requirements for getting there.
Frankly, however, as an employer you might be better off with someone who may not have the same raw mental talent, but at least knows when he or she is beaten. Better to work it out in detail first then hack together a possibly faulty solution. At least that’s my view (as someone who is not a mentalist). Not to mention it is by no means clear that building software requires the same skillset as maintaining (testing/securing) it.
Task-specific information visualization
I previously mentioned my doubts about general purpose information visualization techniques. Too often these seem to make a pretty picture for a conference, where the focus is the novelty of the visualization — e.g., a new way to display a fisheye view — rather than a sober focus on how that picture aids understanding. It is the cognitive support an infoviz offers that is its raison d’etre. It should show you something about the data you didn’t know before, some hidden pattern. But too often revealing that pattern requires knowing about it, a paradox. How do we reveal without knowing what we are looking for?
There’s a newish sub-discipline in computing called visual analytics. From a key resource:
Technologies are needed that will support the application of human judgment to make the best possible use of this information… [we need to] define a long-term research and development (R&D) agenda for visual analytics to address the most pressing needs in R&D to facilitate advanced analytical insight.
But why presuppose visual analysis is the most important/cutting edge thing to do? I speak with the experience of three years working with a visual tool for ontology understanding (analysis, if you will). My experience was that the visual aspect of the tool got in the way of answering the questions. Now, granted, the tool may not have been professionally designed, it had bugs, it used Swing, etc. But I haven’t been convinced by the later pretty pictures from other tools. What we are about is answering questions. I worry that when we come at this from a visualization perspective, we are essentially carrying a giant hammer around looking for nails.
I think the first thing to do is start with good, old-fashioned user centered design. Ask them what they want. Then brainstorm some solutions. Come up with some new interfaces. Iterate, allowing their reactions to the prototype to inform any new questions they might have. Don Norman casts some doubt on the capacity for this process to produce innovative design, but I think that’s fine in this context: we are answering existing questions. I agree that UCD can be held back by what Vicente calls the task-artifact cycle. Nonetheless, it is more important to identify what to look for (particularly given the high failure rate of innovative designs).
The problem is that to be effective, a cognitive aid has to be able to get out of the way. It should be completely unobtrusive. This is, for me, what Apple is so good at doing. You start off complaining about their design choices, but almost always it turns out to be really effective.
Here’s an example I was given once. Every police radio car has a laptop now, and it is here the officer punches in your licence and registration when he or she stops you. It queries the database and returns any information it might have on you — outstanding warrants, prior arrests, who knows. Well, a friend of mine told me that this screen is nearly always filled with useless information. Most of the time it is either empty, as you are not “known to police”, or it comes back with reams of data. If he stops someone he knows is a ‘bad guy’, it will report arrests back many many years. But that isn’t what he wants to know. He wants to know a) can I arrest him for anything outstanding b) what exactly has he been up to (recently). And right now that’s all they can find out. But I can think of many other questions: who has he been arrested with? Has he got any recent weapons offences? When was he stopped before but not arrested? And so on.
The challenge then is to
- answer the standard questions officers ask of the machine now;
- then look for other questions they didn’t know they had.
It would really surprise me if a well-tailored SQL report wasn’t sufficient at this point. To paraphrase a famous quote, if the answer to your analysis question is “use my infovis tool” now you have two problems. You still need to answer the original question, but also to learn this tool.



