Semantic Werks

Scientific Software Development

Neil Ernst — Thu, 25 Apr 2024 18:39:37 GMT

Overview

We’ve been doing a lot of research in my group around scientific software and technical debt, funded by the Sloan Foundation. As part of that work, I’ve written a post about the topic, along with some open research questions. While it is mainly for our own use, perhaps others will find this helpful. It is mainly an extended lit review somewhat loosely organized into themes.

Scientific software, or research software, is software that is written to support a scientific endeavour, including data collection, modeling and simulation, end user support, and others. Examples include codes for modeling climate changes, calculating shear forces in buildings, managing observation time on telescopes.

Other definitions: FAIR for Research Software, the DLR categories, (Hasselbring et al. 2024). The main challenge is that some research software moves: it may start as DLR application 0 (personal), but over time ends up in AC2 (long term libraries) or 3 (safety critical). AstroPy is a good example: it consists of common astronomical operations that many astronomers use, but started as core functions in individual projects.

My Journey

I worked in spatial analysis in undergrad: as an intern, mapping water rights, goat habitat, producing maps for the coast guard. I then worked in my masters with ontologies for cancer biology. In my time at the SEI, I was lucky to work on early planning for the US research software sustainment program from the NSF, which introduced me to the US movement for RSEs, with people like Dan Katz and Jeff Carver.

Since joining the university as a professor, I have been intrigued by the challenges of devloping complex software, in particular, how design choices influence subsequent problems, i.e., technical debt. This also aligns with a wider sense I have that software for climate modeling is a key capability for our future.

Early Work

The earliest work in RSE is basically the field of numerical analysis and HPC. In some sense, the entire discipline of software engineering is derived from scientific software, using simulations to model nuclear weapons, do weather forecasting, etc.

The field of RSE seems to have started between 2000-2005 or so. I think this coincides with general availability of software and the internet and web to connect people. There was a lot of focus on e.g. computational workflows at the time from Carole Goble. People were working on these issues before (e.g., at large US national nuclear labs such as Los Alamos) but there wasn’t a clear definition of the job area.

Jeff Carver’s first workshop is around 2008.

(Segal 2009) is an early paper at CSCW looking at the socio-cultural dimension. A few interesting things:

One, the idea that the main issue is to support the science. This is the key requirement.
There is an early focus in some of these papers on the scientist as end-user programmer. This was a popular notion in the 2000s that I think was on the one hand an utter failure (e.g. VBA), but on the other hand, just natural with the right UI (e.g., Excel formulas). And AI will make this much easier. Greg Wilson said something about how when he was teaching scientists in the 2000s, he struggled to motivate why they should not use Excel. Diane Kelly pushes back on this - of which more below.
A really detailed ethnography on a single lab management project.
As with most software projects, this one struggled with team dynamics and power.

A followup is (Segal and Morris 2012). This paper is similar to the preceding one, except for a more generic focus on the main differences with conventional development, and the ubiquitous (for the time) focus on Agile.

I should digress here for a moment and say I don’t find software process very interesting. The models - Scrum, Lean, Waterfall, etc. are all highly idealized and in my experience rarely followed in practice. Asking if we should be “agile” is answering the wrong question. You want to deliver things faster and with higher quality, so looking at practices to help that is the key. Anyhoo.

James Herbsleb did some work as well e.g. (Howison and Herbsleb 2013). They looked at incentives.

What Makes Scientific Software Developers Different?

Scientific software ~~has~~ seems to have a different context than a lot of other code. For example, it might be continually maintained. It has different testing needs. RSEs therefore have different approaches. Side note: I am not yet convinced scientific software really is different. A lot of the issues–complex domain requirements, performance intensive problems, team composition–are common in other domains as well.

(Pinto, Wiese, and Dias 2018) and (Wiese, Polato, and Pinto 2020) both report on a survey replication on scientific software developers. Nothing jumped out at me - the problems seem mostly similar to normal development; library problems, stable requirements, etc. Surprising to me was that only 5% of the issues seem science related. But I wonder if that is because the questions were not clear. If one asked where the major cost and development effort is spent, or talked to end users …

(Cosden, McHenry, and Katz 2023) is a survey of RSEs that looks at how they get into the field. Most are domain experts (75%) and the rest are CS grads. Both have different education challenges.

(Carver et al. 2022) continued this; again the findings are mostly interesting as a catalog of the state of things.

A lot of the differences, from what I can tell, is that for a long time RSE was not a career path, and so those folks were temporary, poorly paid, and not recognized.

Testing Scientific Software

(Carver et al. 2007) did a series of case studies looking at how scientific code was maintained, and why Fortran was so popular. I think this paper may have harmed the field, by downplaying some of the complexity involved. It comes across as “this is pretty easy stuff”. But some big projects in the DOD space are studied. Around the same time came (Basili et al. 2008). An observation of that paper is that

although HPC is focused on “performance” for scientists the performance of the code is less interesting than “time to result”, i.e., a publishable outcome. That time spans writing the code, testing it, running the simulation, etc.
Validation is tricky; outputs are often non-deterministic and probabilistic, inherent in simulating and modelling complex phenomena.
Programs are long-lived so there is deep scepticism of new tools as plenty of tools cannot make it 30 years. Funders authorizing Voyager, SKA, CERN’s LHC expect the billions of dollars to be used for decades. The software should be able to match that.
Programmers love being close to the metal, to keep things speedy.

A related paper is (Hook and Kelly 2009), which, while focused on mutation testing, has a nice figure showing the ways error can make its way into scientific code.

TODO: (Babuska and Oden 2004) TODO: (Eisty and Carver 2022)

Theories and Models

(Jay et al. 2020) reports on a workshop on translating scientific theories into code. I feel like this is where my interest is most piqued at the moment.

In addition to addressing the general difficulties common to all software development projects, research software must represent, manipulate, and provide data for complex theoretical constructs. Such a construct may take many forms: an equa- tion, a heuristic, a method, a model; here we encapsulate all of these, and others, in the term theory.

They point out the various places things can happen badly - in the sceince, in the code, and in the translation.

The whole idea of scientific computing is to test an imperfect theory of the (natural) world. As such the code and the theory often tradeoff:

Although it is natural to think (and is most often indeed the case) that one needs to formulate the equations and then apply computational algorithms to obtain the numerical solutions, the formulation of the equations can be affected by the choice of computational method. Cf. the simulations books

This blog post covers some of the early papers here in detail, although gets the intuition of chaos theory incorrect. Distinguishes between “error of measurement” and “error of specification”, looking at the tradeoffs between making models more accurate, but also more likely to cause issues with measurement error compounding.

Types of models/theories

Mental model of code: (Naur 1985)’s idea of theory.
- the code encapsulates a theory, that different people come to different, ideally shared, understandings of.
- each developer then adds to that theory his/her understanding.
- The theory (embedded in the software) is refined and adapted over time, e.g., with refactoring, new features, bugs, etc.
- The code in turn relies on different theories in architecture of the hardware, programming language, and packages/dependencies (e.g., what access control means)
- The theory might be encoded as a conceptual model, using a model-driven step as well, e.g., Simulink or Matlab code generation.
- There might be an explicit model the software presents for the science it encodes (“climate simulation using a 1km grid”), and another for ancillary functions.
A scientist has a theory, which the code should help to test/validate/confirm (choose your epistemological poison).
The end users have requirements and expectations of the code, as they use it.

Domains of Knowledge

I really liked this insight from an early research in RSE, Diane Kelly. (Kelly 2015) summarizes work on nuclear scientists in Canada and identifies knowledge domains. I’ll use climate models as an example:

Real world - how the carbon cycle works, solar radiation, forcings, etc.
Theory - the math underlying climate, e.g. differential equations, Navier-Stokes, thermodynamics.
Software - how to write effective Fortran code
Execution - how to compile Fortran, and deploy it to a cluster
Operations - how to use a climate model in production, including running experiments, testing outputs.

What this paper does is show how building scientific software is about moving between these worlds. I think the contention is that while more conventional software (payroll management) has elements of all 5, the real world is easier to understand, and the theory does not require advanced math. Plus the software is likely written in a more familiar language. But scientists probably don’t have a lot of training in 3, 4, 5, at least in the surveys done so far.

Tech Debt in Scientific Software

TODO: (Arvanitou et al. 2022) - How do SE practices mitigate TD. TODO: Melina Vidoni’s papers on R packages

(Eisty, Thiruvathukal, and Carver 2018) - a survey on how RSEs use metrics. They found that RSEs have a low knowledge of metrics, but of the ones used, performance and test metrics were most common. In appendix A they report on the types of metrics - only one respondent had heard of TD and none used it.
(Connolly et al. 2023) argues for a focus on the Three Rs - Readablity, Resilience, and Reuse. They detail the ways in which these three things can be accomplished depending on the importance of the project, e.g., individual, group, or community. It is not explicit about technical debt except that it focuses on software ‘resilience’.

Tech Debt and External Dependencies

Konrad Hinsen (Hinsen 2015) writes that the main issue is the dependency problem - e.g. in Konrad’s case, changes to Python 3 or new versions of Numpy.
(Lawrence et al. 2018) writes about ‘crossing the chasm’. The old free lunch model said new improvements in the same architecture (x86 for example) would improve speed. But now need to take advantage of parallelism and multicore, which require hardware-specific optimizations. There is a very thin abstraction over the underlying hardware in these performance intensive environments, which means even end-users often need to know obscure memory architecture details to squeeze concurrency. # Types of Scientific Software

Like all software, there is no one size fits all definition of scientific software. It can span many domains, is of varying complexity, written in different languages, etc. However, broadly speaking there are hobby projects and professional projects, characterized mostly by the number of support engineers and budget for operations. A hobby project is something a single PhD student might start and is often open source. She is the only developer and it is part of the PhD research. A professional project is something like the Atlas Athena software, with hundreds of contributors, full time staff, and decades of history. And of course this is a continuum. The German Aerospace Center (DLR) has similar guidelines., where level 0 is for personal use, and level 3 is long-lived, safety critical code.

Scientific Software in Canada

The state of the practice for RSEs in Canada is pretty dire. From a government perspective, we spent a lot of time (and $$) on building infrastructure. That was connecting things with high speed networks (CANARIE) and large compute clusters (Compute Canada). Then, for murky political reasons, there was some transition from those orgs to a central one (The Digital Research Alliance). Unfortunately it seems while the tangible cluster and network stuff continues to get buy in from the main funders, Innovation, Science, and Economic Development Canada¹, the software piece is harder to motivate.

Canada has no research software engineering alliance, like the UK, Germany and the US do. We have no real research labs, like the US DOE labs, and we don’t really do defence research outside of the DND Research groups. We once had software in the National Research Council, but that was axed, again, for reasons I don’t understand but had something to do with cost cutting.

Fortunately, there are some excellent folks in the space who are trying to keep things afloat, a few folks at the Alliance, and some (like me) academics. There are also top notch specialists running the clusters and software support teams at the universities, like UVic’s research computing team.

Things I’d like to know more about

how much time does a developer spend on the “science” part of the code, and how much on ancillary roles
Can we separate the science logic from the non-science logic?
1. What is the TD inherent/possible in translating from science to software? Pub pressure, student knowledge, legacy code
2. “Can we quantify or explain this loss/difference, and articulate the trade-offs resulting from translation?”
how do we compare different scientific approaches simply from software alone?
how do you retract/code review the scientific code?
1. what is the equivalent to peer review of the code?
2. what if the code is a complex model that is unexplainable? how do we test it? where is the science?
Can we trace the way in which the design of the code has changed from its initial design to the proper current design?
Social debt: how do we check what implications are? How does large team science play a role?
Ciera Jaspan’s paper (Jaspan and Green 2023): tools can tell you the current indicators. But what matters is how context defines this as a problem or not. E.g., migrating to Python 3, undocumented Navier-Stokes code. How do we extract this contextual knowledge from a project?

To Read

Initiatives

Better Scientific Software - training materials for RSEs.
Code Refinery - more training
Software Carpentry
Software Sustainability Institute
US-RSI
NumFocus grants
Chan/Zuckerberg grants
Exascale Computing - Interoperable Design of Extreme-scale Application Software (IDEAS) DOE 5 year software program
NSF large instrument group
RESA
IRIS-HEP
Collegville workshops

Various “scientific software community of practice” as mentioned in the Connoly article, at UW, CMU, etc.

Venues

Conferences, meetings, workshops
- SE4Science workshop
- Supercomputing conference workshops
- US-RSE conference - October
- UK-RSE conference - September. Why these two are so close in time is a puzzle.
- Alliance Canada Research Software conf. Now discontinued :(
Journals
- JOSS (and unnamed proprietary journal ending in X)
- Geoscientific Model Development (GMD)
- Computing in Science & Engineering

Example Projects

Climate

CESM
Can CM

Astronomy

SKAO
AstroPy
Einstein

Bio

Biology
rOpenSci and relevant paper https://arxiv.org/pdf/2103.09340.pdf
PsychoPy: This project is related to psychology and neuroscience.
biopython: This project is related to Molecular Biology.
RDKit: This project is related to Chemistry Informatics

Glossary

RSE: Research Software Engineer
SSI: Software Sustainability Institute
HPC: high performance computing, e.g., ‘supercomputers’

References

Arvanitou, Elvira-Maria, Nikolaos Nikolaidis, Apostolos Ampatzoglou, and Alexander Chatzigeorgiou. 2022. “Practitioners’ Perspective on Practices for Preventing Technical Debt Accumulation in Scientific Software Development.” In Proceedings of the 17th International Conference on Evaluation of Novel Approaches to Software Engineering. SCITEPRESS - Science; Technology Publications. https://doi.org/10.5220/0010995000003176.

Babuska, Ivo, and J.Tinsley Oden. 2004. “Verification and Validation in Computational Engineering and Science: Basic Concepts.” Computer Methods in Applied Mechanics and Engineering 193 (36-38): 4057–66. https://doi.org/10.1016/j.cma.2004.03.002.

Basili, Victor R., Jeffrey C. Carver, Daniela Cruzes, Lorin M. Hochstein, Jeffrey K. Hollingsworth, Forrest Shull, and Marvin V. Zelkowitz. 2008. “Understanding the High-Performance-Computing Community: A Software Engineers Perspective.” IEEE Software 25 (4): 29–36. https://doi.org/10.1109/ms.2008.103.

Carver, Jeffrey C., Richard P. Kendall, Susan E. Squires, and Douglass E. Post. 2007. “Software Development Environments for Scientific and Engineering Software: A Series of Case Studies.” In 29th International Conference on Software Engineering (ICSE07). IEEE. https://doi.org/10.1109/icse.2007.77.

Carver, Jeffrey C., Nic Weber, Karthik Ram, Sandra Gesing, and Daniel S. Katz. 2022. “A Survey of the State of the Practice for Research Software in the United States.” PeerJ Computer Science 8 (May): e963. https://doi.org/10.7717/peerj-cs.963.

Connolly, Andrew, Joseph Hellerstein, Naomi Alterman, David Beck, Rob Fatland, Ed Lazowska, Vani Mandava, and Sarah Stone. 2023. “Software Engineering Practices in Academia: Promoting the 3Rs—Readability, Resilience, and Reuse.” Harvard Data Science Review 5 (2).

Cosden, Ian A., Kenton McHenry, and Daniel S. Katz. 2023. “Research Software Engineers: Career Entry Points and Training Gaps.” Computing in Science & Engineering, 1–9. https://doi.org/10.1109/mcse.2023.3258630.

Eisty, Nasir U., and Jeffrey C. Carver. 2022. “Testing Research Software: A Survey.” Arxiv:2205.15982.

Eisty, Nasir U., George K. Thiruvathukal, and Jeffrey C. Carver. 2018. “A Survey of Software Metric Use in Research Software Development.” In 2018 IEEE 14th International Conference on e-Science (e-Science). IEEE. https://doi.org/10.1109/escience.2018.00036.

Hasselbring, Wilhelm, Stephan Druskat, Jan Bernoth, Philine Betker, Michael Felderer, Stephan Ferenz, Anna-Lena Lamprecht, Jan Linxweiler, and Bernhard Rumpe. 2024. “Toward Research Software Categories.” https://arxiv.org/abs/2404.14364.

Hinsen, Konrad. 2015. “Technical Debt in Computational Science.” Computing in Science & Engineering 17 (6): 103–7. https://doi.org/10.1109/mcse.2015.113.

Hook, Daniel, and Diane Kelly. 2009. “Mutation Sensitivity Testing.” Computing in Science & Engineering 11 (6): 40–47. https://doi.org/10.1109/mcse.2009.200.

Howison, James, and James D. Herbsleb. 2013. “Incentives and Integration in Scientific Software Production.” In Proceedings of the 2013 Conference on Computer Supported Cooperative Work, 459–70. CSCW ’13. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2441776.2441828.

Jaspan, Ciera, and Collin Green. 2023. “Defining, Measuring, and Managing Technical Debt.” IEEE Software 40 (3): 15–19. https://doi.org/10.1109/ms.2023.3242137.

Jay, Caroline, Robert Haines, Daniel S. Katz, Jeffrey C. Carver, Sandra Gesing, Steven R. Brandt, James Howison, et al. 2020. “The Challenges of Theory-Software Translation.” F1000Research 9 (October): 1192. https://doi.org/10.12688/f1000research.25561.1.

Kelly, Diane. 2015. “Scientific Software Development Viewed as Knowledge Acquisition: Towards Understanding the Development of Risk-Averse Scientific Software.” Journal of Systems and Software 109 (November): 50–61. https://doi.org/10.1016/j.jss.2015.07.027.

Lawrence, Bryan N., Michael Rezny, Reinhard Budich, Peter Bauer, Jörg Behrens, Mick Carter, Willem Deconinck, et al. 2018. “Crossing the Chasm: How to Develop Weather and Climate Models for Next Generation Computers?” Geoscientific Model Development 11 (5): 1799–1821. https://doi.org/10.5194/gmd-11-1799-2018.

Naur, Peter. 1985. “Programming as Theory Building.” Microprocessing and Microprogramming 15 (5): 253–61.

Pinto, Gustavo, Igor Wiese, and Luiz Felipe Dias. 2018. “How Do Scientists Develop Scientific Software? An External Replication.” In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE. https://doi.org/10.1109/saner.2018.8330263.

Segal, Judith. 2009. “Software Development Cultures and Cooperation Problems: A Field Study of the Early Stages of Development of Software for a Scientific Community.” Computer Supported Cooperative Work (CSCW) 18 (5-6): 581–606. https://doi.org/10.1007/s10606-009-9096-9.

Segal, Judith, and Chris Morris. 2012. “Developing Software for a Scientific Community.” In Handbook of Research on Computational Science and Engineering, 177–96. IGI Global. https://doi.org/10.4018/978-1-61350-116-0.ch008.

Wiese, Igor, Ivanilton Polato, and Gustavo Pinto. 2020. “Naming the Pain in Developing Scientific Software.” IEEE Software 37 (4): 75–82. https://doi.org/10.1109/ms.2019.2899838.

Footnotes

I initially wrote this as “industry and economic development” which really gives the game away, as most of the money seems to be used in subsidizing industry in a desperate and futile attempt at improving productivity.↩︎

The PI as COO

Fri, 03 Jun 2022 07:00:00 GMT

Faculty at a research university wear many different hats. One analogy might be to the executive roles in a company. Now a company tries to make profit, which is usually not the goal in academia (quite the contrary). But still, one can think of the mapping like

Title	Tasks for Faculty Member
CEO	Strategy, overall planning, team management
CFO	Budgeting, managing funds, making “payroll”
CIO	Figuring out IT equipment, choosing between cloud, govt infrastructure, managing equipment
CTO	Looking at trends and new tools (e.g., Google Colab, VS Code, Qualtrics)
CMO	Marketing the team and PI to the world, creating a need for the lab’s products (papers)

The one role not listed here is the one that I think in many ways is the most important: COO. Now, I don’t know much about business roles, but to my mind the Chief Operating Officer is sort of similar to the executive officer of a submarine. They are the person who makes the business work: ensuring inventory is at the right level, planning for new space as the company grows, managing supply chains, etc. Tim Cook, now CEO at Apple, made his name growing Apple’s manufacturing to the vast network it is now. That meant ensuring secrecy, getting supplies to factories, scaling to manage millions of devices being released at the same time, etc. Apple isn’t profitable and gigantic if the day-to-day operations aren’t running efficiently.

But the same is true in academic life!

I think of the analogy as requiring thought and attention to the day to day management of productive work in the university. You need to make sure the ‘operations’ are smooth. One small component of this is the paper funnel: we need to ensure we have a lot of ideas in our funnel, that they get turned into data collection and analysis, and that the analysis gets written up and submitted, and eventually published. This is Arvind’s statement below about “getting sh*t done”, because it can be frustrating to think of oneself as moving papers from idea to publication. We want to pretend we are supposed to be thinking about ideas, noodling on the whiteboard, and being inspired by genius. And we are! But that’s not the COO part of the job.

Academics would double our productivity if we learnt some basic project management skills that are bog standard in the industry. We have this myth that scholarly success is all about brilliance and creativity, but in fact 90% of it is getting sh*t done, same as any other job.
— Arvind Narayanan (@random_walker) June 2, 2022

Metrics

We could look, like I’m sure Apple does, at operational metrics and efficiencies. For example:

Number of papers in draft/being reviewed/published (WIP)
Time between idea and paper publication (lead time)
Students graduated in expected time
Number of important unanswered emails in Inbox
To the nearest thousand, how much money is in various accounts, and what the projected “burn rate” is for those accounts.
How long has each student been in the program, what milestones have they finished, and when they should graduate
Grant money received
Grants used efficiently
Reviews conducted within deadline
Talks invited/given/follow up
Size of industry collaboration address book
Travel reimbursements received within 30 days of trip
Time between equipment being needed and equipment being purchased and equipment delivered

Now for some of these you might say “but someone else is blocking that!” Which is of course true of EVERYTHING and also not an excuse Steve Jobs was likely to accept. That’s all part of being efficient and operating smoothly. If you know the university takes forever to process room bookings, you need to factor that in to the operational goals.

Why Care About Operations

I think operational efficiency is what separates average researchers from those seen as impactful. Sure, in some cases it might be a brilliant one-off paper, but often we reward output volume: “quantity has its own quality”. Did the project lead to a single paper, or did you harvest 3-4 papers from it? That’s an operational detail that has to do with a PI’s ability to direct students, target appropriate venues, manage meetings to keep the papers on schedule, and so on.

I think the importance of the COO view of one’s career is that for better or worse these outputs are the easiest to turn into data, and subsequently evaluate you, and your institution. So ignoring number of students graduated, or number of publications, or grant values, will result in poor scores on these data metrics. It doesn’t matter how many brilliant ideas you have if no one gets to read them.

The question for this COO view of a research career is to figure out which metrics one truly cares about, and when to stop focusing on operations and think more about strategy and trajectory. Metrics, because the metrics you choose reflect your priorities (e.g., papers published vs industry collaborations nurtured), and strategy, because (hopefully) the research you pursue should reflect some higher level of understanding about what problems are important to be spending time on.

My approach

For me personally, it can be hard to remember to manage the operational details. The easiest way for me to see this concretely is when papers fail to meet a venue deadline. That’s an operational failure: we didn’t move fast enough on data analysis, the meetings were not productive and the project spun its wheels, I didn’t kill the project or value the cost of delay, I answered emails about committee work rather than spending 2 hrs editing.

My current management approach is to check in on each project (I have about 9-10 in various stages) weekly, using a dedicated card (using a note in Apple’s Notes tool). A Kanban board with stickies can be really helpful here too, but the important thing is not the particular system but that you use it and check it regularly.

Another idea I have just started to implement is reflecting on lessons learned from a project (e.g., after a paper is published). Not just the research problems, but the operational challenges. What would I do differently for project management? What worked well in moving the project along? Was this a productive collaboration? Why did it get delayed (it’s always delayed)?

The Scientific Method 2021 edition

Thu, 03 Jun 2021 07:00:00 GMT

The typical Science workflow is something like

A - Related work: build a Prior about a real world problem, like pulsar formation

B - Fieldwork: Collect data about that problem in the ‘field’

C - Model: Build a model of the problem

D - Posterior: Generate a posterior/data output

E - Analyze: draw inferences about A, updating the prior if necessary.

We can have several types of problems - “debt” - in this process.

In A, we could have a misinformed prior. We might not have seen the latest result, we might have a bad mental model of the real world.
In B, we could take shortcuts in data collection, e.g. a “small telescope”

Note that in A/B, we are in traditional science. This is the stuff you would learn in Astro 200 - how planets form, how to take readings, etc.

In C, we traditionally did nothing very special for our ‘model’ to create a posterior. OLS regression is the most likely approach, and we could use tools like Excel or some deeply trivial subset of SPSS. The model might even be implicit in the machine itself, such as a spectrograph. Now, of course, our problems are more noisy, the datasets from B larger, and the ‘equipment’ - the math - more sophisticated. So our models are expressed as Fortran code, as Jupyter code running MCMC, or as deep neural models. Here we enter the world of software and data engineering. We might need code review, as opposed to just peer review.
In D, we have the problems - and depreciation - of moving data around at volume. This includes network infrastructure, GPU provisioning, HPC problems. There are obviously large set of questions here that I won’t tackle, but you all probably know a lot about!
In E, our challenges are again scientific: draw supported inference, do classification, prediction, create better explanations and build useful theories. Debt in this context is again about sloppy science: P-Hacking, HARKing, and so on. Personally I think open science policies will go a long way to removing those problems.

I want to focus on C, as an issue of scientific technical debt. I think a lot of challenges have to do with understanding the tradeoff between the scientific challenges in A/B/E, and the engineering focused challenges in C and D. Ultilatemly to do good science - to design vaccines, to predict climate adaptation approaches, to better understand pulsars - we need to have all of these phases working efficiently. Since I know about C that’s where I’ll focus.

Making Models Accessible

There are two dimensions (or more) to this question. The first is about using models and building them; the second is about querying them.

Model Usability

We now see tons of tools that help with modelling. Here are some illustrative examples

Fortran, in Global Climate models
Python, in Ralph Evin’s building sims
Astropy
Commercial: Tableau, RStudio, Metabase, https://core.hash.ai/@hash/wildfires-regrowth/9.8.0 are moving from building dashboards and visualizations into more complex modelling support. But like dashboards, the challenge is less about using bars vs lines, and more about the A/B and E parts: what problems do I need to learn about.

One challenge in the move to more complex models is of course the inflection points: the places where the models fail to capture the real world. It is trivial to build a predator-prey model; to build one that accurately captures the dynamics of wolf/elk dynamics in the Mackenzie valley is entirely different, and probably the validity of the model is only understandable by maybe 100 people in the world. Increasingly the challenge in peer review is not about (or not just about) the problem’s relevance or significance, but whether the data and model support the claim. Technical debt in science is a massive concern if we are going to rely on large, difficult to verify datasets for our claims.

How easy it is to build the model:

Show examples of RStan, Fortran, Hash as dealing with the problem of “building the model”

Accessing Model Outputs: Digital Twins

Bryan Lawrence blog post

The idea that in SKA you might not get raw data

Finding Bugs in Notebooks

Consider our study of how people find notebook problems. Given a notebook on a topic the subject understood well, we asked them to find any problems (0-4) in the notebook. This is essentially peer review of the notebook, except the authors can’t reply or help, like in code review. With “many eyeballs all bugs are shallow” is a flawed truism that seems to hold for science; if we can get lots of people looking we can find some of the flaws.

Hands up if you’ve had to rewrite analysis or model code because of a flawed assumption, misunderstood stats package, or unreliable initialization?

On Bots in Software Engineering

Neil Ernst — Thu, 15 Apr 2021 07:00:00 GMT

Bots as assistants

The state of bots

There have been a number of categorizations of bots for software development. The main categories seem to be the ones that Erlenhov came up with, which look at bots as either API endpoints (CI tools), developer assistants, or something more sophisticated. I don’t think there is much to say about CI tooling. This is a spectrum in my view; we will likely see CI tasks grow more sophisticated and multistep.

The paper Peggy and Alexei wrote in 2015 seems to indicate nothing has changed much since that paper. In one sense we are still just doing the same thing; bots as API interaction points, where they automatically or under prompting carry out some well defined and usually pre-existing task like a compiler and compiler flags.

This is in contrast to the bots that appear on corporate sites to maximize engagement and increase sales. There the bot acts as a query refinement tool, helping you to sort out what you actually are looking for. This type of extensive interaction seems to contradict what developers want.

So one question is instead of what bots should do, what tasks are we looking to help with. Bots are useful because they encapsulate common tasks that otherwise humans would have to do. For example we once took punch cards to an operator to enter into the computer; now that is done by hitting compile/run. I had a hierarchy of problems we can get help on:

Syntax problems - compilers, integrated into most IDEs, now flag obvious problems before you need to run the code. IntelliJ for example can automatically detect type problems.
Warnings and flags - running a compiler with default options just gets the bare minimum. Standards like MISRA-C specify known problems the compiler can find for you.
Linters - plenty of problems with code are known, so we can find these things that clearly violate best practices, such as equality checks in Javascript. Often these are integrated into tools like SonarQube or CI environments.
Code smells - the next class of code issues are called smells and have to do with slightly more complex problems, often spanning multiple modules. So for example long methods, long parameter lists, and so on. This is also where we might find violations of language paradigms, such as not using list comprehensions in Python.
Design problems - the highest level of problems we might ask for help with are design flaws. Here we want to flag issues that we think will impinge on quality attribute satisfaction. We might identify that the code misuses the Obsever pattern.

There are some others that are a bit tangential to these: flagging security problems; identifying dependency or build issues; UI problems; test coverage issues.

The fundamental issue in my view for AI assistance is that it is super easy to tell people what they already know. So in self-driving the issue is not to drive from A to B in the sun; I could probably train my son to do that. What we need is meaningful assistance, telling us the things we didn’t know, or couldn’t know, on our own. Thus bug localization that finds bugs we already know about, code smell detectors we don’t care about, and most other data studies published on effort estimation and so on.

The challenge is that getting tools to the 70% accuracy level is super easy, with today’s tools. But humans are insanely smart, and so that first 70% is not the hard part. It is in the last 30% that we need the help, but also the hardest to decipher; figuring out what the image is really showing in the dark.

I think bots are similar. Right now they are basically just dumb endpoints to an API with a slightly improved interface. Thus Dependabot telling us that a library is outdated is not particularly interesting, since it is just an interface to a more complex script running in the background. What’s novel in Dependabot is that it is able to interact in a well understood way, not the complexity of what it is trying to do. Similarly the bots one sees on airline sites are not interesting assistants, but only simplistic interaction techniques in a web search world. After all John Mylopoulos was working on natural language interfaces to databases 40 years ago.

So what does this more interesting bot look like then? Is it more than just an API endpoint or something else? In my view the next step is truly an assistant that is contextualized to the person asking the question, to the project in which it runs, and the particular time it is being run. It is in short a very efficient, Ms Moneypenny like administrative assistant, capable of anticipating the needs of the developer, but unobtrusively.

How do we get there? There’s a number of pieces for this vision.

Interface: how do developers like to get information? Right now that is things like IDEs, compiler warnings, interactions on pull requests: working with the artifacts they already care about
Persona: how should an assistant interrupt? Should they be factual and say we detected a problem on line 50? Or should we omit GUilfoyle’s annoying verbal tics?
Context: how does the assistant extract the context information it will need to be useful and not annoying?
Metrics: what would be a relevant way of assessing success? I don’t think we even have a good idea on this.

Tasks

Detailed tasks: Mylyn style tasks that have to do with a specific problem like finding a bug, refactoring a method
High level tasks: get an overview of the system in order to see how it is progressing. This bot might send a weekly update on lines of code added. Sort of exists as Github’s various visualizations.
Design tasks: help me understand how the software will respond to quality attribute requirements.

Tasks bots can help with

Are bots just “API endpoints”?

Bots are api calls plus “vocal tics” like the fridge in Silicon Valley

“It’s bad enough it has to talk. Does it need fake vocal tics like ‘ah’?

“The tics make it seem more human,” Dinesh tells Gilfoyle.

“Humans are shit,” Gilfoyle replies. “This is addressing problems that don’t exist. It’s solutionism at its worst. We are dumbing down machines that are inherently superior.”

The challenge in these systems has always been that entry level knowledge is extremely easy to retrieve, but going deeper is way harder (kind of like self-driving). For perhaps 80% of the interactions online, the bot can manage. But it is in the details that bots get stuck and need to call for the operator to step in. We saw something like this with expert systems. Coding something that can advise people to call their doctor when they report a fever of 102 or higher is pretty simple. But the complex explanations as to what is causing the fever are fairly intractable (explanation is usually thought of as NP hard, after all). Getting the knowledge in to solve the problem (basically, all the heuristics and learning that an experienced GP would have) is very expensive - the KA bottleneck. This is probably less costly now with deep learning. But the other bottleneck is the reasoning. Even if we have that knowledge, inference to multiple competing explanations is very expensive. Recommending the most common explanation—such as viral ear infection in a toddler—is what bots basically do now, but in many cases there isn’t a clear common explanation, or there is no clear set of symptoms to diagnose.

Bots for TD reduction

One area we see a lot of activity is in static code analysis to find rule violations. There are more rules than programmers could reasonably want to use; code quality checks, syntax warnings, code smells, etc. The problem in fact is these warnings annoy developers. At Google they had a scheme where the code checks would be rejected if they had more than 10% false positives which developers could vote on. These tools generate multiples more warnings than developers actually take action on.

How could bots help? Well, the main issue I notice is the need for interactivity. Bots could easily process the boring problems in one shot (fix all trailing commas), but more importantly the bot could be an interface to the tool, instead of the common approach which is either a dashboard with TMI, or some simple weekly report. The bot instead could be a text interface to the tool itself, and run predefined queries or make new queries, adapting on the fly as the situation demands it. So bots are good for rapid re-contextualization which you do not see with dashboards, which require sophisticated analysis to configure.

A bot is also automatable so it could delivery weekly updates to the developer without him or her having to do anything with that info. Again though it seems like we are pushing the complexity - what reports do I really need - into another interface. The tough problem in TD presentation is to figure out exactly what context underlies the data and only show that.

We need to find problems and generate data

We need to filter and store the data

We need to query the data

We need to visualize the data.

Bots don’t make any of these easier per se….

REFSQ Panel session on Open Data and RE Education

Thu, 15 Apr 2021 07:00:00 GMT

Together with Alessio Ferrari, I organized a panel at the well-regarded conference on Requirements Engineering: Foundation for System Quality, which is a mouthful but really nice working session on RE that has always nicely blended practice and research. It also was the first place to accept one of my papers so I will always have a soft spot for it, and for Essen, industrial city or not.

The purpose of the session was to encourage open data packages in the context of RE Education (the aim, I think, is to have subsequent OpenRE tracks at REFSQ change theme). We got excellent submissions and accepted three packages, which we have hosted at the existing repository of the RE Education and Training (REET) workshop.

After the short talks on the packages, we turned to a panel with the theme of “RE in the age of COVID”. Our hope was to collect some experiences from the attendees (40 or so) on how they approached RE education, and RE in general, during the COVID induced shift to online learning. We definitely got that and more generally, I think it was a cathartic session to commiserate and share with others the challenges of the past few terms.

A few lessons I drew from the discussion:

Participants were a bit torn on the need to completely redesign the curriculum vs sticking with the previous content. “Maybe my course was boring and remained boring!” Of course in some cases just getting online was sufficiently challenging to prevent major redesigns. In general projects worked well in both formats. There was some thought that lab exercises worked better, since it was easier to checkin—students were in a fixed location!

Learner styles or perhaps preferences (since “styles” are not a thing) was something we didn’t have a good handle on. Some students definitely prefer online. But no one had data on who is doing better, and who is doing worse, online vs offline. For example, students seem to appreciate recorded lectures, but mostly to replay/relisten. The downside is it is harder to get questions in a recorded lecture. Even in more normal settings, students are not always sold on flipped classroom. Then there is the problem of video content: should we re-record videos? In the end it’s about making the content relatable and helping them through the struggle with it. Thus there is no substitute or tech fix for the need to demonstrate empathy, use multiple learning techniques, and I suppose the things we know work well in teaching regardless of venue.

There was growing recognition that—online and off—bringing some levity and enthusiasm, such as via Serious Games, was critical to keep people engaged in the Youtube and Netflix era. Dan Berry, who has a charismatic personality, suggests we think about being a comic. But of course that will not work for everyone. Even during ‘normal’ lectures, it is not uncommon for 60 students to turn into 3-4 actively participating, 15-20 in class, and some coming merely to sleep.

In other settings, participants acknowledged a need to maybe step away from the computer and do a lecture from outdoors, away from disruptions. The 1 year mark of the pandemic led to a let down in formality, with less emphasis on formal backgrounds and acknowledgement that it was ok for it to be weird.

There were a few folks who ran hybrid classes, where the university allowed some reduced subset to attend class. The popularity of this depended greatly on the perceived safety level. There were often technical challenges e.g. mic’ing students and sanitizing mic before answering a question, how to get video that made remote participants still feel engaged (e.g., eye contact).

The final takeaway was about student well-being. Birgit Penzenstadler, who studies this in her research, emphasized the need to meaningfully check in and get beyond the “how are you doing” question. This is, as she points out, precisely an RE elicitation problem, e.g. “what is the biggest impediment” you are currently facing. We agreed that for most of us, the reality of needing to meaningfully check-in was hitherto unappreciated, and something that is completely independent of learning modality or current global crises. Meaningful checkins are certainly something I will be including in my own teaching practice, online or off (but I hope in person!).

The Triumvirate of Teaching and Work Life Balance

Wed, 03 Mar 2021 08:00:00 GMT

Most courses have a series of learning outcomes for students. Once you have done the course (and, I assume, gotten a B or some reasonably high mark), then you know how to accomplish the learning outcomes.

Some may break those learning outcomes down to smaller units per module.

For instructors, it occurs to me there are three objectives to balance (at least):

How much effort it takes to teach the topic
How much students appreciate the topic and teaching choices
How much students learn after being taught

Number 2 is not ever considered in pedagogy, but is the ONLY thing that matters from a management point of view. It maps directly to the things RateMyProf and course evaluations measure. Therefore from a pragmatic point of view, a prof should only care about 1 and 2.

Number 3 is what pedagogy is all about: how much are students learning? This is what is referred to when we look at things governments fund us to do. They want more “skilled workers”. Students in the short-term don’t really care much about this stuff. And course evaluations don’t really test for this. Teachers who are really good at 3 often end up getting punished for 2 (learning things is hard and not fun!)

Number 1 is often ignored, too, but can be the difference between having a good term and a shitty term. For any given topic, there are many ways to think of teaching that topic. Take data flow diagrams:

- we could lecture off a set of slides showing DFDs
- We could use a textbook reference and just skim the topic
- We could create a detailed case study and show DFDs by construction
- And for each of these, we could come up with different teaching strategies: whiteboard, live coding/drawing, interviewing experts etc.

As someone with limited time, one goal has to be minimize #1. My contention is that it is easy to go for perfection in 3 and absolutely devastate yourself in #1.

What I try to do is:

Destroy with fire any plan that maximizes 1, but minimizes 2 and 3. If students aren’t learning, aren’t enjoying it, and it is a lot of work, you should NEVER do it. And I bet you would be surprised how often this case happens. For example, in my third year software design course I spent a ton of time (increasing 1) ingesting the Play framework, learning how it worked, so that students could use it in the project. But it was a huge pain for the students to work with (hurting 2), they wanted to use React instead (hurting 2), and most of the learning was about the Play framework itself, rather than software design concepts (hurting 3). I won’t be doing that again.
Ruthlessly assess how important #3 is for your career. I haven’t seen anyone whose teaching packet evaluates 3. And yet it seems like we all talk about how it is the only goal. I really hope researchers improve on this. For tenure, for example, I seriously doubt anyone is looking at this. The closest we come is peer evaluation, but as Mark Guzdial wrote, this is often the blind leading the blind.
Teaching awards seem to be about 2: winners tell jokes, they dress nice, they are male, they seem knowledgeable.
Given what we think we know about how software is built, I am going to guess that some teachers are really effective (either for 2 or 3) for minimal effort, and others put in orders of magnitude more time on 1, but have little extra to show for 2 or 3. I believe there is a non-linear, diminishing returns model for teaching effort; you might do 10 hours of prep for a lecture and not have much more to show for (3) then if you had done 1-2 hours.
There is a sunk cost/amortization problem here, too. If you teach a course the first time, you may have a lot of 1 to pay off. Subsequent offerings might greatly reduce 1 and allow you to focus on 3 (or 2) to a greater extent. But I’m not sure how much this is true. Things move fast in software courses, especially in 2nd year and above, some costs simply don’t amortize (marking), and we often try to improve the course year to year. Plus, we might not get to teach that course more than a few times.

I want to be clear that I am not endorsing a focus on teaching for “show” instead of long-term learning. 3 is clearly the goal. But the reward structure does not reflect this. We should figure out how to ensure that teaching is measured against outcomes on 3, and not on 2 (2 is also horribly biased!).

More importantly, there seems to be embarrassingly little data on how to minimize 1 and maximize 3. I think this is a problem. We have a lot of info on (for CS1 at least) how to best teach linked lists, such as using Parsons problems. But frankly, my job involves teaching 40% of my time. I may not be able to dedicate the time required to prepare Parsons problems for the course. So a “cost/benefit” (1 vs 3) analysis would be very useful to help me maximize the teaching effectiveness for unit of teaching effort.

Running a Mining Challenge Using Kaggle

Wed, 30 Sep 2020 07:00:00 GMT

For the 2nd edition of the Dynamic Software Documentation (DysDoc) workshop, the organizing team wanted to push the boundary on how to engage the community in tool supported demos. Previously, we had asked participants to come to the workshop (co-located with ICSME) with a tool to demo, live, to the other attendees. One of the goals was a tool that worked on unseen data.

This year, at our organizing meeting, we wanted to try something that went beyond documentation generation, and looked at other issues with dynamic documentation fixes. A study by Walid Maalej and Martin Robillard, which looked at types of API documentation, included an interesting issue with documentation - code comments - that were uninformative.

/** 
 *  Clears the log
 */
public void clearLog() {
  LogMessages.getInstance().clear();
}

This comment is clearly not adding information to the code, and in fact, might even be harmful, if it were to be outdated. Thus our “Declutter” Challenge: figure out a way to identify these type of comments and (eventually) target them for removal. I was co-organizer alongside Nicole Novielli.

We were inspired by the success of datasets and benchmarks such as Fei Fei Li’s ImageNet contests, or the SAT competition. Both of these have been influential in driving innovation in graphics and satisfiability solving. As it turned out, these distributed competitions were also ideally suited to the new remote work paradigm that was required during the COVID pandemic.

To set up this competition, there was the option to have the competitors take the dataset, work on a solution, then submit their solution to the organizers for evaluation. Of course, this involved a lot of work on the part of the organizers, and being fairly lazy, I looked for an alternative approach. Immediately the Kaggle competition platform seemed the way to go: it has been hosting large-scale data science competitions since before there was data science.

I therefore investigated how this could work. Normally, hosting Kaggle competitions require payment (commercial) or prizes (academic). Academic contests are also selected by Kaggle. Fortunately, Kaggle makes the platform available for classroom use, on Kaggle InClass. The difference, other than no support, is that competitors do not get Kaggle points for entering. Nicole and I thus decided to use Kaggle to host the Declutter challenge, which you can find here.

What Worked

Despite lacking support, getting the contest going on Kaggle was fairly simple. Nicole organized labeling with the rest of DysDoc’s organizers, and hosted a gold set on Github. I then used the gold set to generate the inputs Kaggle wanted. This includes the gold set, split into training and test. Test data is further split into “public leaderboard” and “private leaderboard”, since competitors can submit multiple entries, and see where they stand on the leaderboard. Only the organizers get to see the private leaderboard, which is what ultimately ranks the competitors. You can see that in practice here.

I also had to choose between Kaggle’s available validation metrics, in this case choosing F1, and this can be a bit finicky, as you have to map between the columns in the solution CSV file and whatever Kaggle’s automation expects. Clearly at paid support levels they would make this simpler, or just do it for you.

Despite not having a lot of labeled data, we managed to get a good set of data, and Kaggle’s infrastructure worked - as far as I know, anyway! - with no problems. Competitors download the data, run their model, and then upload a solution file with their predicted labels.

We managed to get 2 principal competitors, one of whom submitted several distinct entries. Both entrants published their submissions at our workshop, which can be found at the ICSME proceedings site.

Improvements and Questions

I was quite happy with how simple Kaggle made the process of evaluating entries. It also scales flawlessly (unlike me), and in theory, could help us dramatically expand our contest. In the COVID era, of course, it also made it pretty easy to host a remote contest, unlike our previous approach, which used in-person demos.

It would be nice to have more support for notebooks, or perhaps a mandatory notebook submission, so that we can see how each group approaches the problem (after it finishes of course).

As far as I am aware, this is one of the first challenges to be hosted on Kaggle. To me it seems like an obvious choice for running and hosting automated benchmarks, such as the various effort estimation and defect prediction datasets out there. If we could disambiguate entries, that would help with understanding who is entering.

Kaggle makes it possible to host an ongoing, never-ending contest, which is also appealing. The obvious bottleneck, unsurprisingly, is data annotation, and at this point I would say that is the main obstacle to running more such contests. However, we have tentative plans to continue the approach in future workshops.

Academic Job Searches—A Canadian Perspective

Thu, 16 May 2019 07:00:00 GMT

Academic job interview season is wrapping up, so I thought I’d capture the process from the Canada point of view.

Academic CS jobs in Canada follow mostly the same pattern and process as the US (here I am talking about research-focused, tenure-track roles). Hence I think most of the advice from Philip Guo, and Wes Weimer and his academic offspring, are totally applicable (and indeed, were what I relied on in my search). There are a few subtleties I think are useful for applicants to know. Disclaimer: I am relatively junior, and only have limited experience applying in Canada, so these insights are based on my limited sample and from talking to colleagues here. I am also not a legal or immigration expert, so I make no warranty about this advice.

Understand why you are interested in Canada

In the Weimer/Le Goues job seeker’s guide, they make the point that US candidates tend not to move to Canada. I think it is safe to say that if you have spent your entire master’s/PhD in the States, you have worked in the NSF/DARPA/DOD model of funding, and have family ties there, then the switch to Canada would be a big change. I think this is especially true for smaller schools like ours. We’re a small place but proud (both Canada and Victoria!). So make it clear in your cover letter or interview why you would come. Hopefully reading this guide will help!

Explain/communicate the proposed Discovery grant you would win

In Canada funding is fairly different than the US. For one thing, Canadian schools have substantially lower tuition for grad students (although that is changing). Faculty research budgets have much lower student stipends as a result, and the grant sizes reflect that. A moderate US grant might be 100k/year; in Canada that would be equivalent to 20k, but support the same research program.

Your application and interviews should demonstrate you understand this. I suggest reading up at the NSERC page, and also the research services page for the university you apply to.

The main grant for new faculty is the Discovery grant. It is a five year grant worth from $20k-50k a year. You are evaluated equally on your ability/experience with highly qualified personnel; your personal ability as a researcher (i.e. CV); and the research proposal. Not holding a Discovery grant is a problem because getting other federal funding depends on this, to some extent. The good news, especially for ECR, is that success rates are relatively high (60-75%). You can expect to prepare this the summer you get hired, for submission by Nov 1.

Your job talk and your research statement should outline some elements of the five page grant proposal you would write. Departments want to see what you would propose, and how able you are to communicate your vision to external readers. It is a 5 year program, so scope your “future work” to that time frame.

I think this is broader advice than Canada-only, but one thing I’ve noticed is that applicants who are just finishing a PhD give more narrowly focused talks. Two things to keep in mind if this describes you. 1. You will be competing with people who have 2-4 years of post-doctoral training, and a corresponding breadth of research, more experience training students. Stretch your talk to show how you have the potential to succeed like that. Conversely, one question about post-docs is often “how independent can they be”. This is particularly true if you come from a big lab with a famous PI. 2. Think like a professor. What grant areas will you target? How would you manage 5 masters/phd students? How will you balance teaching load with research? I don’t think you need to feel uncompetitive: we invited you for a reason. But the onsite interview is when we want to see if you are ready for what ~~can be~~ is a very demanding job.

Funding

Engagement with industry The federal government has been a big supporter of industry partnerships recently, although the programs were recently overhauled. This typically means that if you have an industry partner with skin in the game, i.e. financial assistance, you have an excellent chance of obtaining government matching funds. Conversely, if you prefer pure research with no immediate outcomes, finding funds might be more difficult. There are very few large granting agencies. There is no equivalent to DARPA/IARPA, DHS, DOD, DOE funding in Canada; those projects would work with specific people at specific agencies to secure one-off funding. In BC, nearly all grants would come via NSERC programs, or MITACS matching. There are also Networks Centres of Excellence such as MEOPAR that allocate funds in targeted areas (these are being phased out). There is also a recent Defence initiative, IDEaS, to increase Canadian funding for research with defence applications. Finally, there were industry-led superclusters announced, but who/what gets funding is still very unclear. It seems to focus mostly on subsidies for industry-led research.

In general, I would say finding funding is much more individualized and distributed than in the States. There are plenty of places to find adequate funding (again, a student probably only costs 20-25k a year), but how to get it is much less clear than a DARPA BAA program. A cynic might say this is because funding announcements are more closely tied to electioneering.

Summer students and internships. We have a similar program to the REU approach, called USRAs. These are government matching for student research semesters. Again, these are allocated on a per-institution basis (bigger places get more).

We have an excellent grid/HPC/cloud computing infrastructure, ComputeCanada. They conduct yearly resource allocation competitions. I don’t know what the success rates are.

For large infrastructure, e.g., robots, 3d printers, tabletop displays, quantum computers, the Canadian Foundation for Innovation holds annual competitions for this, but success rates are fairly low.

Tenure

Well, more to come from me on this one, but my general sense is that tenure is more collaborative and mentoring than many US places. I don’t think there is the equivalent of “didn’t get the NSF grant, didn’t get tenure”, or “didn’t get 1 million in funding, didn’t get tenure”. That said, standards are just as high as US Tier 1 schools; we just want to help you achieve them. We’re friendly, eh?

Specialty hiring

CRCs. You may be in the enviable position of applying for a Canada Research Chair. These are a nationwide funding mechanism for research positions. We have Tier 1 (7 years, renewable, senior) and Tier 2 (5 years, renewable, junior/emerging). They typically come with higher salary and teaching relief. Each university gets a quota from the federal government. The approval process is a bit more involved. In addition to approval from (department-faculty/dean-VP academic/provost), you will have your application submitted to the federal government, wherein the case will be made that you are uniquely qualified, amazing, etc. This is almost never turned down, from what I can tell, but could be. In particular, the federal government has a strong desire to see equal allocations of these CRCs to male and female candidates.

Requirement for hiring Canadians

Departments are usually required to prefer Canadians over non-Canadians, for immigration purposes. This means that of two totally equivalent candidates, the Canadian citizen or permanent resident would be made an offer. If you are a PR/citizen, or applying for PR, that is worth highlighting somewhere.

Immigration is easy

I can’t speak from experience, but my understanding is that immigration to Canada as a permanent resident, and eventual citizenship, is much easier than the US process (with which I do have experience). This is also true for immediate family (spouse/children). In some cases, permanent residency is possible in months, not years.

Salary and benefits

In general Canada pays less salary. Keep in mind that it is a 1 year salary, not 9 months. Most Canadian schools don’t have the concept of a summer salary. At UVic, we operate on 3 equal semesters, and allocate a research semester where you would like (subject to teaching needs of course).

The CRA survey has more useful information. Health care is provincially funded from your taxes, so don’t expect to lose 500-600$ a month to health premiums. From working in the US, even being a well-paid employee at a great employer, there was a significant cost (mental and financial) in understanding yearly plan changes—even without chronic conditions.

In most places, faculty are unionized or quasi-unionized. This means you fall into a grid, and your salary increases will be based on a formula in the collective agreement. You can probably look this up online for each institution you visit. Hint: you want to move up the grid as much as possible before you start the job. So Prof. Le Goues’s advice on startup over salary might change, since your salary will be the baseline for future percentage increases.

Summary

I would sum up by saying Canada is an awesome place to do research, and I hope you apply to Canadian universities! Especially mine!

Resources

Maclean’s Guide to Canadian Universities: This is the Canadian equivalent (in all respects, good and bad) to US News and World Report. It divides universities into medical/doctoral, comprehensive, and primarily undergrad. Canada does not have the same diversity of higher education as the US—for example, there are few private institutions here. The main division for research is whether the school has a medical school or not, as med schools are tightly controlled (public health care dictates number of seats), and med schools tend to accumulate massive amounts of research funding. My school is categorized as a comprehensive, but I wouldn’t say this equates to “more teaching”.
NSERC: The main engineering funding body, similar to NSF.
Taulbee Survey: Various stats on academic CS jobs, including some from Canada.

Bayesian Hierarchical Modeling in Software Engineering

Sat, 16 Jun 2018 07:00:00 GMT

At MSR18 in Gothenburg, I presented my work on using Bayesian inference to set software metrics thresholds. We want to set thresholds because for many software metrics, like coupling between objects (CBO), a single, global metric value (“all software objects with this value or below are maintainable”) is nonsensical, if only because programming language choice is important. So we want to tailor threshold values to some contextually relevant value (e.g., perhaps all Java code should be X or less). The question I answered is how we do the tailoring, given some contextual features.

In this case, the contextual features I was looking at were Java files categorized by architectural role in the Spring framework, derived from a paper by Mauricio Aniche and others.

The bottom line of this new approach is that we can use Bayesian inference and hierarchical models to perform a simple regression and get a 50% drop in root mean squared error (RMSE).

The more interesting conclusion from a methodology point of view is that hierarchical modeling with Bayesian inference fit software engineering data very well, and is straightforward to model given modern probabilistic programming languages. I followed a similar approach to the one detailed in this blog post on hierarchical modelling with PyStan. There are two key ideas.

Use a combination of global data as regularization over the detailed, local model. In this case, the global data comes from all the different Java projects. The local model is the specific coupling metrics for one particular project. The effect is to allow each individual project’s slope and intercept values to vary by some amount dictated by the global values.
We model this using a Bayesian approach, which means we will condition our likelihood based on the data we are observing, and use that to estimate a posterior distribution. I really like this approach because it forces you to think about your prior distribution (what should the metrics distribution be?), and also because it produces a posterior distribution, and not a single point estimate. A distribution is much more flexible for making inferences than a point estimate (e.g., we could say “set the threshold where < 75% of the probability mass lies”).

This was also a fun project to do from an open science approach. I used Jupyter as my notebook throughout the project, and my notebook and the paper/presentation are both available.

Seven Principles of Effective Documentation

Mon, 17 Jul 2017 07:00:00 GMT

There has recently been more discussion about software documentation (or perhaps that’s because I only see what I’m interested in… hard to say). At any rate, it seems a lot of discussion inevitably breaks down to “what tool will solve my documentation problems” (e.g., this thread). Others have tried to “fix” UML by proposing new modeling approaches (forgetting, perhaps, that the unified modeling language was spurred by exactly this proliferation of diagram notations).

I don’t think tools, or formats, or templates, or modeling languages, will ever solve the problem you have. But what will help is to put some people in charge of the project who can think clearly and knowledgeably about what exactly is needed. To that end, the most effective advice (yet perhaps least immediately actionable, as compared to “buy X”) are the principles of effective documentation, originally from the Parnas and Clements paper “A Rational Design Process: How and Why to Fake It” ¹. Its more concrete form is published in the SEI text “Documenting Software Architectures”, and is part of the introduction to the course we teach.

Write from the reader’s point of view (and know who your readers are). Probably also the first rule of good technical writing. You need to understand who will use the documentation: management, downstream developers, other contractors, you (one, five, ten, twenty years from now), government program offices, etc.
Avoid unnecessary repetition. This is easier in the wiki/hyperlink era. Sometimes repeating key figures is helpful, especially if a particular section may be read in isolation.
Avoid ambiguity (and explain your notation). This is where most modeling discussions end up for me. Pick whatever language works for you, but explain its syntax (and semantics where necessary). It may be as simple as a key that says “UML 2.0 activity diagram”. There’s nothing worse, or more common, than a diagram with a mix of colors and shapes that no one understands who was not in the room. And keep in mind Martin Fowler’s helpful breakdown of UML Modes.
Use a standard organization. Templates make it easier to find information.
Record rationale. You might be able to recapture the “design” from the code, or tests, but you have little hope of understanding why certain architectural approaches were chosen if no one wrote the reasoning down. Most of the essays in the book “Architecture of Open Source Applications” capture rationale, at least in hindsight (which is fine, after all we are “faking” a rational design process).
Keep docs current, but not too current. I would interpret this nowadays as “have a release schedule” and make it clear what portions of the docs reflect “as-is” vs “to-be”. It’s also about which portions of the software you need to document. Low-level implementation decisions are only necessary if they have some impact on the important qualities of the system (otherwise, they aren’t architectural, and don’t need to be documented!)
Review the documentation. Like any software artifact, you can’t know how well documentation “works” for your audience until you test it. That means understanding if stakeholder questions can be answered with the documentation (e.g., “can I see how the system handles authentication”).

The other “principle” we mention, but is not part of this list, is “if it isn’t needed, don’t do it”. Documentation (good, up to date documentation certainly) has a cost. Only incur that cost if you are going to realize benefit from it (and naturally, the cost is the upfront cost + maintenance cost).

I think most of the tooling discussions fall from these principles/rules. For example, Daniel Procida gave a presentation on “4 Elements of Successful Docs”, recommends docs have how-to guides, tutorials, discussions, and reference content. This maps to writing for the reader, and recording rationale.

In this perspective, a lot of discussions can be better grounded. For example, “avoid ambiguity” motivates the use of something like UML. The UML is useful at least as the “most common” notation people are aware of (and has many many reference books). Using Markdown to keep things current with your build system can help to keep things current. Confluence or other wikis help with organization and avoiding repetition. And so on.

As a good researcher, I should mention this topic greatly interests me. If you want to collaborate, get in touch! I think there’s a lot of room for interesting contributions in making documentation better.

Footnotes

Parnas and Clements, Trans. Software Eng. 12(2), 1986. http://web.engr.oregonstate.edu/~digd/courses/cs361_W15/docs/IEEE86_Parnas_Clement.pdf↩︎

Moving to UVic

Mon, 26 Jun 2017 07:00:00 GMT

I’m excited to announce I will be taking up a position this fall as a tenure-track faculty member in the Department of Computer Science at the University of Victoria.

This is a great opportunity to work with some of the top software engineering faculty in the world, in one of the best cities in the world (although I’m biased, as it is my hometown :). Victoria is at the forefront of the startup scene, just a few hours from Vancouver, Seattle, and direct flights to the Valley (not Abbotsford, the other one).

If you are interested in doing research with me please take a look at my ‘prospective students’ page. Uvic, and Canada, welcome people of all backgrounds. See our study permit process, and the federal Express Entry program for post-graduation immigration opportunities.

I want to thank my colleagues and co-workers at Carnegie Mellon and the Software Engineering Institute for a great four years. I’ve learned a lot about software architecture, large-scale software projects in government agencies, and more US military acronyms than I care to admit. I’ll also really miss Pittsburgh, which has been wonderfully welcoming and a pleasant surprise. It’s easy to see America through a particular perspective these days, but many—most—Americans are awesome and caring people.

You can continue to reach me via Twitter, @neilernst, via this web page, or via email, neil@neilernst.net.

Visual Abstract attempt

Thu, 11 May 2017 07:00:00 GMT

In response to Greg Wilson’s challenge, I did a quick attempt at a Visual Abstract for a recent paper.

Visual Abstract: Identifying Design Rules in Static Analysis Tools. Evaluated 464 rules, 19% design related, 67% easy to classify.

I think it turned out ok; it captures the core findings and presumably will prompt people to look at the paper. I’ve attached the Keynote slide I used in a repo on Github.

A few comments:

Graphic design is a skill you need to work on (duh). Even with this template, I don’t think it is super compelling. I just used out of the box icons.
The footer kinda loses relevance when you don’t have a big name JOURNAL behind you. Something else can go there. I used the conference logo, but maybe some logo for your lab.
I wanted a URL to point to the full paper, but a bar code might be better. If anyone uses that anymore.
For qualitative papers, and maybe software engineering in general, the “outcome data” at the bottom is more difficult to come up with. I don’t know if I can easily pull three nuggets of improvement for each paper (but I hear Greg’s baritone susurration saying “well yes, that’s part of the problem”)
As someone on Twitter pointed out, these are intrinsically visual and thus not accessible to the visually impaired. I do think they help the “academicese-impaired”, but each time one of these is used, I would hope a non-visual summary is also presented. Pulling the text together shouldn’t be too hard. I’ve had a go in the “alt” text above.
Doing a whole batch of these (say for an ICSE track) would be a fair bit of work. Presumably you could pick papers you’ve had to read anyway (and cared about). But summarizing the contributions is not so simple (for me, anyway). Again, perhaps that points to a wider problem.

I’ve noticed more and more papers calling out contributions in special boxes, and bulleted lists in the introduction. I think this is great. One of my pet peeves is a reviewer who points out some trivial English error, but tolerates the total incoherence of the introduction.

Related to this is a visual portrayal of the methodology. This happens a lot in medicine, where lots of experiments are conducted and explaining complex cross-over designs is important. But you can see a similar example in Borg et al. 2017, below:

Sample workflow cartoon

This explains how the study was conducted. Again, anything that can explain what is going on for a busy reviewer is helpful. Remember, in 2017 FSE reviewers seem to have reviewed 25-28 papers each. Expecting them to spend more than 30-45 minutes on each one is unrealistic. So make your time with them count!

On Active Learning in Software Engineering

Wed, 10 May 2017 07:00:00 GMT

I’ve read 2 papers recently (references) about using active learning to improve classification for software engineering.

Active Learning is the idea that, if we have a feature space with instances, for the classification task of labeling an instance either “A” or “B”, there are clumps of points that clearly are As, and another clump that are clearly Bs. In between, however, the boundary is unclear. Some instances will be equidistant from both centers of mass, and the classifier will struggle to properly classify it. Active Learning (AL) quite simply picks what are hopefully the “most useful” (for improving the classifier) points for human labeling, which reduces the amount of (possibly) redundant labeling humans have to do (since the labeling time is the most costly part of creating a classifier).

It turns out that so far this active learning approach for software data is not too successful. I think there are 2 main reasons.

1) We don’t have much data. Classifiers do better when they see more instances. In the 2 studies I read, the number of unclassified instances was measured in tens of thousands. Contrast this with most image recognition or information retrieval applications, which have orders of magnitude more training data. In the case of the ImageNet context, they also label substantially more instances (e.g. 150,000 labeled with 10 labels).

2) More importantly, I think the task is fundamentally difficult. The Borg paper makes this clear; when human raters themselves cannot agree on a label, it probably won’t work any better with active learning. I think this is because some problems have fuzzy label boundaries for non-core feature reasons, while many software concepts are innately (ontologically) unclear. Think about labeling photos of house numbers. I’m pretty confident that any two humans would agree that instance X is a house number. We have clear and simple criteria for what a “number” is (intensionally and extensionally). The reason the classifier struggles is because of non-intensional properties of the data itself: perhaps a tree obscures the top of the 1, or a shadow is partly on the lower digits. In software data, that problem exists as well (e.g. someone talking about an old version of Rails). But for labeling an utterance as technical debt, or a performance bug, or a usability concern, there seem to be broad disagreements on the core discriminating features. If we talk about paradigms like distributed computing, is that an “architectural” discussion? What about a bug that results from not understanding an RPC service?

We’ve looked at some of this in our latest research on design rules. We found that while a majority of static analysis/code checker rules can be clearly distinguished as either design or not, there remains this stubborn middle tier that resist easy categorization. We think you can still make progress (after all, these rules may not even fire on your project). But it would be satisfying to have a more repeatable analysis.

Categorizing rules

The conclusion of the Borg paper really seems useful for future work here. One, they say that the AL approach helps to pull out controversial instances, that then help build rater consensus. Two, using bootstrapping with more positive examples helps the AL improve its accuracy (in other words, there is still benefit to grinding out the labels manually – no free lunch, sorry!).

References

N. Van Houdnos, S. Moon, D. French, Brian Lindauer, P. Jansen, J. Carbonell, C. Hines, W. Casey. “Human-Computer Decision Systems for Cybersecurity”, Presentation. https://resources.sei.cmu.edu/asset_files/Presentation/2016_017_001_474277.pdf
Borg, M., Lennerstad, I., Ros, R., Bjarnasson, E. “On Using Active Learning and Self-Training When Mining Performance Discussions on Stack Overflow”, arXiv:1705.02395v1. Preprint of paper accepted for the Proc. of the 21st International Conference on Evaluation and Assessment in Software Engineering, 2017.

Thoughts on Amy Ko’s “PL as …” keynote

Fri, 04 Nov 2016 07:00:00 GMT

Amy Ko had a great presentation at a conference on programming languages (PL), that he also video taped for a wider audience.

I’d always thought of PL as “things”, or material. The program was the interesting bit; the PL was the material that constructed it. But I guess as I extend that metaphor, it seems clear that it falls short. Cedar, for instance, is a material, and the building is the interesting thing. But cedar has intrinsic properties as well. You can bend it without cutting it to make a box.

bentwood box

It weathers beautifully due to the oils it contains, so you can make shingles and siding out of it. It burns very easily. If we extend the definition of cedar to include the tree itself, we can make canoes, rope, hats, spears, and so on. Cedar was a crucial part of Northwest aboriginal culture.

And so in translating that thought back to the PL world, it seems clear that PL also has this. The syntax of Java vs C in ease of learning the language. The ecosystem of Javascript vs Clojure in building apps. The culture of web programming languages vs scientific programming languages. And so on.

The one quibble I have—clarification, to be more accurate—is the slide on definitions, values and community weighting toward the end. The implication → goes one way for a reason. That is, because we chose to focus on PL as math, we have, as a result, a lot of focus on the value of certainty. But that isn’t to say because we value certainty, we focus on PL as math. In fact the reason for ‘valuing’ this value are complex and systemic: most CS departments started with math graduates, most CS departments still contain math-heavy disciplines like theory and machine learning, we want to show correctness and soundness, and math is the way to do it. So it isn’t to say that PL community does not value equity, or the others, but rather that equity is hard to prove, and PL academics function in a math world.

Finally, Amy had this great list of what form PL takes, and associated research questions, which I’ve shamelessly duplicated here so people can more easily copy it.

Programming languages as ….

power
- what responsibilities does knowing PL come with?
- how does PL corrupt?
- should democracies distribute it?
design
- what tradeoffs are made?
- what is a “good” PL design process?
- how can we rapidly prototype PL?
- what are PL aesthetics?
media
- what message is enabled by PL?
- how does PL facilitate expression?
notation
- what can PL not model?
- what info can PL not share?
- what makes a PL learnable?
interfaces
- how can PL convey what is possible?
- how do we make PL usable?
- what feedback must a PL provide?
math
- what does PL correctness mean
- how to prove PL correct?
- what in PL is equivalent?
language
- do PL have ambiguities?
- do PL shape how we computationally think?
communication
- Should PL model developer intent?
- should PL express intent to developers?
glue
- what makes a PL a good adhesive?
- what materials do PL adhere to?
legalese
- who should interpret code legally?
  - are programmers lawyers?
infrastructure
- how do PL decay?
- how should we maintain PL?
- is PL a public good?
path
- should gov’t create the path?
- how do we make PL equitable?
- who should go down this path?

Day Hikes

Thu, 08 Sep 2016 07:00:00 GMT

A list of long, high vertical day hikes I have done and wish to do. I think looking back the most common theme to all of them was “bring more water”.

Hike Name	Length	Elevation Gain	Elevation	Notes
Black Tusk	29km/18mi	1740 m/5700’	7600’	Highly exposed last section up remaining volcanic core. See details
Lions	16km/10mi	1280m/4200’	5427’	I remember when we did it in 2001 or so there being little to no trail markers. Trail page
Triple Crown (Finlayson, Work, Gowlland Range )	? Maybe 10mi/16km	? Probably around 1000m/3200’	1375’	I couldn’t find details on this, but the gist is to hike up Finlayson, go down the backside, then back up along the Gowlland Tod ridge above the inlet, then up Mt Work at the end.
Half Dome	26km/16mi	1450m/4800’	8839’	Cables! Now need permits to go. Trail page
Mt St Helens	16km/10mi	1370m/4500’	8366’	Permit needed. Painful boulder climbing and loose scree from middle to end. Insanely exposed rim of crater. Trail page
Golden Ears	24km/15mi	1500m/4900’	5630’	Some freaking jackrabbit passed us going up and was heading back down before we summited. Even in June had plenty of snow that made the top risky without ice axes and/or crampons. Trail page
Mt Thar	No idea. Took about 6 hrs.	Yak Mtn is listed as 1640’ for prominence, so I’d guess no more than 1200’ for Thar.	Yak: 6693’	Trip report This one is in 103 Hikes in the SW BC, highly recommended.
Monte Bondone, Trentino, IT	About 12 hours.	Trento centro is 636’, so nearly 6500’ of elevation gain (seems high to me)…	7150’	I started in Vela where my flat was. The Italian Alpine club chapter - S.A.T. - has a good trails site and maintains the helpful markers. You can take a cable car back to the river from Sopramonte to shave a few minutes off.
Mt San Jacinto	30km/19mi	1700m/5600’	10,833’	TBD! Trail page

Columbus’s Heilmeyer Catechism

Tue, 19 Jul 2016 07:00:00 GMT

I have no idea if Columbus had to have his “India Expedition” proposal peer-reviewed, but here is my interpretation of it according to the ever-popular Heilmeyer catechism.

What are you trying to do

I would like to sail to India and bring back gold and spices for the Crown of Spain.

How is it done today

Currently no one has sailed west. Everyone takes the trip east, around the Cape of Good Hope. Most of these people think the world is flat and that heading west would cause us to fall into space.

What’s new in your approach

I will head west. I’m pretty sure the Earth is round, and we can reach India from the west in less time

Who cares?

A faster trading route to India, monopolized by our mapping skills, would generate 1 million Real a month for the royal treasury.

Risks

There is a lot unknown about the middle of the Atlantic, including rumors from the Vikings that some colder land is in between. My math may be off in calculating the circumference of the Earth. I am not a great sailor. We may encounter fierce alien tribes.

Cost and schedule

For 1000 Real we can outfit four boats with sailors, supplies, and weapons (note: of course Columbus would never get all he requested, either!). We plan on a quick 1 year voyage to India, and one more year back.

Checkpoints for success

We plan to see India after 2000 nautical miles of sailing. While measuring distance at sea is currently impossible, after 3 months we expect to sight land. If not, we will head back.

On SCAM’s new “Engineering Track”

Fri, 22 Apr 2016 07:00:00 GMT

This year SCAM, the Working Conference on Source Code Analysis and Manipulation (located in Raleigh, NC, Oct 2–3 2016) includes an engineering track, as described here. The CFP is available here. This track will be co-chaired by myself and Jurgen Vinju. In this post I want to briefly explain what an engineering track is and why you should submit to it! ¹

Purpose

Software is an engineering discipline, for most definitions of ‘engineering’. My definition, for what it’s worth, includes the notion that it involves working on real systems that do things, and to that end research in software engineering can be seen as a design science, where the chief task is to “design and investigate artifacts in context”.² This implies that for the most part researchers in this space need to concern themselves with pragmatics: how will this work at scale? How do people do this now? What data can we use that has practical relevance?

However, traditional conference submissions (the dominant form of scholarly dissemination in Computer Science) tend to follow the 10page, aim/motivation/observations/conclusions framework, often full of Greek letters and references to obscure papers. Whether this is a good way to advance the engineering discipline is debatable, but in any event, such a submission tends to ignore two things: one, how people dealing with problems in practice can use the work; two, the artefacts related to the scientific endeavor (the ‘treatment’ in Wieringa’s design science parlance). While improving, too many research papers still do not include tool downloads, fail to show practical impact, or fail to provide for data download to replicate the findings.

Our engineering track is out to improve the practical, engineering-relevant side of source code analysis and manipulation.

Submission types

This track has evolved from the tool track of previous SCAMs. As David mentions,

This is not to discourage tool paper submissions–they will now fall into the Engineering Track–but to broaden the scope of the tools track … for those of you that invest blood, sweat, and tears into tooling, infrastructure, or realistic field studies SCAM recognizes the value of this work, which is not always pure research, and we are designing this track to attract that type of work.

What artefacts qualify as “engineering track” material (from CFP)?

tools: software (or hardware!) programs that facilitate SCAMmy activities.
libraries: reusable API-enabled frameworks for the above.
infrastructure: while libraries are purely software, infrastructure can include projects that provide/facilitate access to data and analysis.
data: reusable datasets for other researchers to replicated and innovate with.
real world studies enabled by these advances. Here the focus is on how the {tool,infrastructure, etc} enabled the study, and not so much the study itself. Novelty of the research question is less important than the engineering challenges faced in the study.

Some of the criteria the PC will look at includes:

How well motivated are the use cases (and hence the existence) of the engineering work. Here we are asking whether this solves some realistic and ongoing challenge in practice. However, we are open to brilliant new ideas that scratch a previously unknown itch³.
Relate the engineering project to earlier work. All engineering is a product of lessons learned, so including some narrative about how this particular submission has evolved is useful (e.g., what paths turned out to be dead ends).

Optionally (and encouraged):

Any empirical results or user feedback is welcome.
Contain the URL of a website where the tool/library/data etcetera can be downloaded, together with example data and installation guidelines, preferably but not necessarily open source
Contain the URL to a video demonstrating the usage of the contribution.

Ideally one would submit and make public the artifacts and required steps to create it. However, realistically people may not be able to (given IP rules, NDAs, etc.).

Program Committee

Building on SCAM general chair David Shepherd’s excellent blog post on industry tracks, both Jurgen and I are committed to a program committee (PC) that has strong industry representation. That doesn’t mean only people who work in industry, but at least means people who have some sense of the engineering challenges of building real-world software. The purpose is to vet submissions against the standards industry holds: not necessarily will work right away at scale in mission critical systems, but that there is some promise of that.

Incidentally, if you are a former academic now practicing, or just a research-minded practitioner, I would love to hear from you for future PCs. We need more folks straddling the two cultures.

Footnotes

Incidentally, I agree with and support the ICSME co-chairs’ statement on the anti-LGBT legislation in North Carolina.↩︎
That definition is from Roel Wieringa’s excellent design science book.↩︎
can itches be unknown? I may be mixing metaphors.↩︎
Incidentally, I am not a big fan of the term “industry” or “industrial”. Maybe it is my location in Pittsburgh, but it conjures up steel mills and heavy machinery. The other problem is the term “industry” is used as a catch-all for a widely different set of folks, from a 2 person startup to a Fortune 500 company or DOD agency. I prefer research vs practice. Not a huge fan of “real-world” either, since we all live in the real world. Presumably.↩︎

On Using Open Data in Software Engineering

Mon, 07 Mar 2016 08:00:00 GMT

I recently reviewed data showcase papers for the Mining Software Repositories Conference, and I’m co-chair of the Engineering track (subsumes datasets, tools, approaches) for the SCAM conference¹. I’ve worked with a number of different datasets (both openly available and closed) for my research efforts. This caused me to do some reflection on the nature of empirical data in SE.

We’ve had a nice increase in the amount of data available for researchers to explore, and most recently, the amount of well-constructed, easily understandable and accessible datasets – like the GHTorrent tool – is impressive (traditionally it has been difficult to get any credit for creating these resources). I think it is a hugely beneficial effort for our efforts to create a well-grounded, empirical basis for software engineering (as opposed to pie in the sky theorizing).

I have two concerns that threaten this idyll.

Concern 1: long term availability and replication

Other fields, primarily psychology and economics, have begun painful self-examinations of well-cherished, supposedly well-grounded results. In many cases replication of these findings is very difficult. What worries me most is that in empirical SE, we aren’t even in a position to attempt to replicate key findings, because (among other concerns) the original datasets aren’t accessible. This is most often because studying companies is usually caveated with “but you can’t tell anyone it was us”; other problems include datasets that are ‘live’ and self-modifying (for example, publishing a study on foul language in commit histories might mean those histories are modified), or just out of date entirely.

I propose two solutions. The first is to use long-term archiving approaches. This means no personal university website, no special database formats, no ‘email me for access’. The open access/open research communities have been tackling this problem for several years, and good solutions exist (like identifiers from DataCite or hosting like with Felienne Hermans’s Figshare). We have software specific repositories as well, of which the Promise repo is the one I know best. The 2016 CHASE workshop encouraged authors to stick papers on ArXiv, but the data should be similarly archived. One problem here is how to store data that institutional review boards might insist be destroyed.

The second is how we cite the data when we publish it in a journal or conference paper. I’ve seen URLs in the body of the paper, standard reference lists at the end of the paper, footnotes, and comments in the author’s notes. We need to standardize this so readers can quickly and easily find the URL to the dataset. My suggestion is to add a separate heading, like the useless ACM keywords, referencing any new datasets used/available, immediately after the abstract.

Concern 2: new ethical questions

My second concern is one I only recently realized (somewhat shamefully), and this is the ethics of publishing data. This was prompted by the GHtorrent debate. There the central objection from (some) developers was that their email address was accessible, perhaps more easily accessible than on Github. Email is an obvious one, but beyond that I think we need to acknowledge that the mental model of a developer pushing code to Github (or wherever) is not one of public visibility. The silly comments they write (“huge HACK!”) could be career-limiting, and pushing that dataset out into the world is a question we have to tackle. My view would be to explicitly include terms of use in data you make available, and Georgios is obscuring identities (e.g. with a SHA1 hash) when asked.

More broadly, the world of SE analytics opens up the chance that the data could be used for purposes we might not understand. For instance, one might show that a particular Chromium developer has an unusually high number of bugs. As researchers we might understand the nuances and limitations of how the data was collected². Managers might not understand the limitations, however. Furthermore, any data one collects is subject to subpoena and other legal requests, so in many cases immediate destruction is the best course (see the debate on sociology notes).

The big risk from NOT considering this is losing the goodwill of a community that only tangentially understands what software engineering research is. I’m not advocating for doing whatever developers tell you to³; I am saying that not considering these issues risks antagonizing the people we are trying to help.

Footnotes

Submit early, submit often!↩︎
Jorge Aranda’s ‘secret life of bugs’ paper sheds light on this.↩︎
You put your freaking email on Github! What did you expect!? (sigh)↩︎

The Marginal Utility of Testing/Refactoring/Thinking

Thu, 21 Jan 2016 08:00:00 GMT

Andy Zaidman had an interesting presentation about test analytics. The takeaway for me was that a) people overestimate their unit test engineering (estimate: 50%, reality, 25%). But b) the real issue is convincing a developer that this unit test will improve the quality of the code. In other words, like with technical debt, or refactoring, or commenting, the marginal utility of adding a test is perceived to be low (and of course the cost is seen as high). Each new individual test adds nothing to the immediate benefit (with some exceptions if one is following strict TDD). And yet each one requires switching from the mental model of the program to the one of Junit frameworks and test harnesses.

The issue is not whether testing is good or bad, but rather, which testing is most useful. It seems unlikely to me that the value of individual tests is normally distributed but rather power-law form (i.e., that there are a very few extremely high value tests). And this isn’t just about testing; indeed, most activities with delayed payoff—refactoring, documenting, architecting—likely exhibit the same problem. It is hard to convince people to invest in such activities without giving them concrete proof it is valuable. You just have to look at the default examples for Cucumber, for instance, to see that the vast majority are trivial and easily grasped without any of the tests. Similarly, “code smells are bad”, but bad might just mean they look nasty, while having little to do with the underlying effectiveness of the code. It isn’t technical debt if it never causes a problem. It isn’t a bug if it isn’t worth fixing it.

In new work we are starting with Tim Menzies, we are trying to understand the inflection point beyond which your decisions add little incremental value (i.e., stop adding more tests). The good news is this is easy to spot in hindsight; the challenge is to take those lessons and determine this before doing hours of pointless work. The direction we are taking is to try and capture the common patterns the key decisions share (in the testing example, perhaps this is bounds testing). Ultimately, we hope to provide advice to developers as to when the marginal utility falls below a threshold (i.e., stop testing!)

The other point is the over-reliance of software engineering on hoary folklore. Things like “some developers are 10x as productive”, or “80% of bugs occur in requirements”, tend to be statements that are derived from a single study, conducted in 1985, on 3 large scale defense projects, but have somehow made their way down the years to become canon. Ours is not the only field to suffer from this, of course. But when capable developers refuse to pay 200$ a year to join the IEEE Digital Library, it seems to demonstrate a firm commitment to ignorance.

A Model of Software Quality Checks

Tue, 22 Dec 2015 08:00:00 GMT

Software quality can be automatically checked by tools like SonarQube, CAST, FindBugs, Coverity, etc. But often these tools encompass several different classes of checks on quality. I propose the following hierarchy to organize these rules.

Level 0: Syntax quality

Focus: code that ‘runs’. Level 0 means a compiler or interpreter’s components (parsers, lexers, intermediate forms) assess syntax correctness. Level 0 because (clearly) without proper syntax nothing is getting done.

Level 1: Lint-free

Focus: Code that respects obvious sources of problem.

No warnings occur if all possible flags are turned on in the compiler. These warnings tend to be close to syntax in their complexity. For example, technically a fall through switch statement is possible in Java, but there is the -Xlint:fallthrough tag to catch this. Often IDEs such as Eclipse will flag these automatically with warning icons.

Level 2: Good code

Focus: Code conforms to commonly accepted best practices for that language.

E.g., for Java, visibility modifiers are suitable, in C, no buffer overflows, memory is released appropriately. Some cross-language practices apply: documentation, unit tests exist, and so on. Many of the quality analysis tools like FIndBugs operate at this level. CWEs are another example. I also place dependency analysis approaches here (perhaps controversially). It also pops up in the next level (e.g., properly using interfaces in Java).

Level 3: Paradigmatic

Focus: writing code that is maintainable, understandable, and performant with respect to its runtime environment.

Would someone writing object-oriented, functional, embedded, etc. code consider this reasonable? Includes principles like SOLID, functional side effects, memory management, distributed code demonstrates awareness of fundamentals of distributed computing. Also includes proper use of language idioms e.g. proper use of Javascript callbacks, Ruby blocks, etc. We might also classify new language features here – the use of generics in Java 7 comes to mind. Essentially, if you did a peer review with a language guru (Odersky for Scala, say), would they have a ‘better way’ to do it? (Perl notwithstanding…)

Level 4: Well-designed

Focus: building systems that respect appropriate (known at the time) usage scenarios.

Given the knowledge available, the code is architecturally appropriate for the quality attribute requirement (QAR) applicable. E.g., modular, performant, secure. The key here is understanding the relevant QARs. Examples include reflexion models (like ArchJava), conformance checking (e.g. Dicto), library analysis (e.g., for license issues, for currency).

Outcome

A few things become clearer when we view software quality with this approach.

First, I think that quality checks become more useful as you move ‘up’ (0→4) in the hierarchy. That is, I’d rather know that I have a serious design problem than a code quality problem.

Second, unfortunately, it seems much harder to design truly automated checks at the higher levels. This is why we have a lot of manual architecture analysis but leave code quality to tools.

Third, our rules get more context-specific as we move up the hierarchy. I.e., in order to properly check paradigmaticness¹, I need to know your choice of programming language and possibly your problem domain properties. To properly do design validation, I need to know what qualities are important to you: performance? availability? That, I think, is partly what makes these levels more useful.

Other hierarchies

The one I’m most familiar with is from Jean-Louis Letouzey. He proposed the SQALE quality model, and his central insight is that some qualities precede others: you must have maintainable code before having performant code, or testable code before secure code.

EDIT [1/6/16]: somehow I forgot this CAST diagram showing different levels of analysis, very similar to mine. They also claim that the ‘system level’ (my design level) is the place where architecture is checked.

EDIT [9/6/17]: also this model from @stoerrle. It also has this notion of abstraction and progression, using

SYNTAX At the one end of the spectrum, there are syntax level issues, such as confusing the tokens “=” with “==”, and preference of language constructs (e.g., avoiding unsafe constructs, default-switch-cases and so on).

PRAGMATICS One step further up, small-scale pragmatic issues like identifier naming, indentation, and simple structural metrics like cyclomatic complexity.

UNITS The next level addresses complete units of code (often a class or module), and considers its overall structure, unit-level metrics (e.g., method/class length) correctness and completeness.

ARCHITECTURE Finally, there is a level of architecture that is concerned with the structure and interrelation of units, e.g., it considers depth of inheritance trees, design patterns, architectural compliance, and other system-level properties.

Footnotes

I’m not sure how to ‘noun’ this adjective …↩︎

Semantic Werks

Scientific Software Development

Overview

My Journey

Early Work

What Makes Scientific Software Developers Different?

Testing Scientific Software

Theories and Models

Types of models/theories

Domains of Knowledge

Tech Debt in Scientific Software

Tech Debt and External Dependencies

Scientific Software in Canada

Things I’d like to know more about

To Read

Initiatives

Venues

Example Projects

Climate

Astronomy

Bio

Glossary

References

Footnotes

The PI as COO

Metrics

Why Care About Operations

My approach

The Scientific Method 2021 edition

Making Models Accessible

Model Usability

Accessing Model Outputs: Digital Twins

Finding Bugs in Notebooks

On Bots in Software Engineering

Bots as assistants

The state of bots

Tasks

Tasks bots can help with

Bots for TD reduction

REFSQ Panel session on Open Data and RE Education

The Triumvirate of Teaching and Work Life Balance

Running a Mining Challenge Using Kaggle

What Worked

Improvements and Questions

Academic Job Searches—A Canadian Perspective

Understand why you are interested in Canada

Explain/communicate the proposed Discovery grant you would win

Funding

Tenure

Specialty hiring

Requirement for hiring Canadians

Immigration is easy

Salary and benefits

Summary

Resources

Bayesian Hierarchical Modeling in Software Engineering

Seven Principles of Effective Documentation

Footnotes

Moving to UVic

Visual Abstract attempt

On Active Learning in Software Engineering

References

Thoughts on Amy Ko’s “PL as …” keynote

Day Hikes

Columbus’s Heilmeyer Catechism

What are you trying to do

How is it done today

What’s new in your approach

Who cares?

Risks

Cost and schedule

Checkpoints for success

On SCAM’s new “Engineering Track”

Purpose

Submission types

Program Committee

“Related Work”

Footnotes

On Using Open Data in Software Engineering

Concern 1: long term availability and replication