Scientific Software Development
Structured Bibliography
Overview
We’ve been doing a lot of research in my group around scientific software and technical debt, funded by the Sloan Foundation. As part of that work, I’ve written a post about the topic, along with some open research questions. While it is mainly for our own use, perhaps others will find this helpful. It is mainly an extended lit review somewhat loosely organized into themes.
Scientific software, or research software, is software that is written to support a scientific endeavour, including data collection, modeling and simulation, end user support, and others. Examples include codes for modeling climate changes, calculating shear forces in buildings, managing observation time on telescopes.
Other definitions: FAIR for Research Software, the DLR categories, (Hasselbring et al. 2024). The main challenge is that some research software moves: it may start as DLR application 0 (personal), but over time ends up in AC2 (long term libraries) or 3 (safety critical). AstroPy is a good example: it consists of common astronomical operations that many astronomers use, but started as core functions in individual projects.
My Journey
I worked in spatial analysis in undergrad: as an intern, mapping water rights, goat habitat, producing maps for the coast guard. I then worked in my masters with ontologies for cancer biology. In my time at the SEI, I was lucky to work on early planning for the US research software sustainment program from the NSF, which introduced me to the US movement for RSEs, with people like Dan Katz and Jeff Carver.
Since joining the university as a professor, I have been intrigued by the challenges of devloping complex software, in particular, how design choices influence subsequent problems, i.e., technical debt. This also aligns with a wider sense I have that software for climate modeling is a key capability for our future.
Early Work
The earliest work in RSE is basically the field of numerical analysis and HPC. In some sense, the entire discipline of software engineering is derived from scientific software, using simulations to model nuclear weapons, do weather forecasting, etc.
The field of RSE seems to have started between 2000-2005 or so. I think this coincides with general availability of software and the internet and web to connect people. There was a lot of focus on e.g. computational workflows at the time from Carole Goble. People were working on these issues before (e.g., at large US national nuclear labs such as Los Alamos) but there wasn’t a clear definition of the job area.
Jeff Carver’s first workshop is around 2008.
(Segal 2009) is an early paper at CSCW looking at the socio-cultural dimension. A few interesting things:
- One, the idea that the main issue is to support the science. This is the key requirement.
- There is an early focus in some of these papers on the scientist as end-user programmer. This was a popular notion in the 2000s that I think was on the one hand an utter failure (e.g. VBA), but on the other hand, just natural with the right UI (e.g., Excel formulas). And AI will make this much easier. Greg Wilson said something about how when he was teaching scientists in the 2000s, he struggled to motivate why they should not use Excel. Diane Kelly pushes back on this - of which more below.
- A really detailed ethnography on a single lab management project.
- As with most software projects, this one struggled with team dynamics and power.
A followup is (Segal and Morris 2012). This paper is similar to the preceding one, except for a more generic focus on the main differences with conventional development, and the ubiquitous (for the time) focus on Agile.
I should digress here for a moment and say I don’t find software process very interesting. The models - Scrum, Lean, Waterfall, etc. are all highly idealized and in my experience rarely followed in practice. Asking if we should be “agile” is answering the wrong question. You want to deliver things faster and with higher quality, so looking at practices to help that is the key. Anyhoo.
James Herbsleb did some work as well e.g. (Howison and Herbsleb 2013). They looked at incentives.
What Makes Scientific Software Developers Different?
Scientific software has seems to have a different context than a lot of other code. For example, it might be continually maintained. It has different testing needs. RSEs therefore have different approaches. Side note: I am not yet convinced scientific software really is different. A lot of the issues–complex domain requirements, performance intensive problems, team composition–are common in other domains as well.
(Pinto, Wiese, and Dias 2018) and (Wiese, Polato, and Pinto 2020) both report on a survey replication on scientific software developers. Nothing jumped out at me - the problems seem mostly similar to normal development; library problems, stable requirements, etc. Surprising to me was that only 5% of the issues seem science related. But I wonder if that is because the questions were not clear. If one asked where the major cost and development effort is spent, or talked to end users …
(Cosden, McHenry, and Katz 2023) is a survey of RSEs that looks at how they get into the field. Most are domain experts (75%) and the rest are CS grads. Both have different education challenges.
(Carver et al. 2022) continued this; again the findings are mostly interesting as a catalog of the state of things.
A lot of the differences, from what I can tell, is that for a long time RSE was not a career path, and so those folks were temporary, poorly paid, and not recognized.
Testing Scientific Software
(Carver et al. 2007) did a series of case studies looking at how scientific code was maintained, and why Fortran was so popular. I think this paper may have harmed the field, by downplaying some of the complexity involved. It comes across as “this is pretty easy stuff”. But some big projects in the DOD space are studied. Around the same time came (Basili et al. 2008). An observation of that paper is that
- although HPC is focused on “performance” for scientists the performance of the code is less interesting than “time to result”, i.e., a publishable outcome. That time spans writing the code, testing it, running the simulation, etc.
- Validation is tricky; outputs are often non-deterministic and probabilistic, inherent in simulating and modelling complex phenomena.
- Programs are long-lived so there is deep scepticism of new tools as plenty of tools cannot make it 30 years. Funders authorizing Voyager, SKA, CERN’s LHC expect the billions of dollars to be used for decades. The software should be able to match that.
- Programmers love being close to the metal, to keep things speedy.
A related paper is (Hook and Kelly 2009), which, while focused on mutation testing, has a nice figure showing the ways error can make its way into scientific code.
TODO: (Babuska and Oden 2004) TODO: (Eisty and Carver 2022)
Theories and Models
(Jay et al. 2020) reports on a workshop on translating scientific theories into code. I feel like this is where my interest is most piqued at the moment.
In addition to addressing the general difficulties common to all software development projects, research software must represent, manipulate, and provide data for complex theoretical constructs. Such a construct may take many forms: an equa- tion, a heuristic, a method, a model; here we encapsulate all of these, and others, in the term theory.
They point out the various places things can happen badly - in the sceince, in the code, and in the translation.
The whole idea of scientific computing is to test an imperfect theory of the (natural) world. As such the code and the theory often tradeoff:
Although it is natural to think (and is most often indeed the case) that one needs to formulate the equations and then apply computational algorithms to obtain the numerical solutions, the formulation of the equations can be affected by the choice of computational method. Cf. the simulations books
This blog post covers some of the early papers here in detail, although gets the intuition of chaos theory incorrect. Distinguishes between “error of measurement” and “error of specification”, looking at the tradeoffs between making models more accurate, but also more likely to cause issues with measurement error compounding.
Types of models/theories
Mental model of code: (Naur 1985)’s idea of theory.
- the code encapsulates a theory, that different people come to different, ideally shared, understandings of.
- each developer then adds to that theory his/her understanding.
- The theory (embedded in the software) is refined and adapted over time, e.g., with refactoring, new features, bugs, etc.
- The code in turn relies on different theories in architecture of the hardware, programming language, and packages/dependencies (e.g., what access control means)
- The theory might be encoded as a conceptual model, using a model-driven step as well, e.g., Simulink or Matlab code generation.
- There might be an explicit model the software presents for the science it encodes (“climate simulation using a 1km grid”), and another for ancillary functions.
A scientist has a theory, which the code should help to test/validate/confirm (choose your epistemological poison).
The end users have requirements and expectations of the code, as they use it.
Domains of Knowledge
I really liked this insight from an early research in RSE, Diane Kelly. (Kelly 2015) summarizes work on nuclear scientists in Canada and identifies knowledge domains. I’ll use climate models as an example:
- Real world - how the carbon cycle works, solar radiation, forcings, etc.
- Theory - the math underlying climate, e.g. differential equations, Navier-Stokes, thermodynamics.
- Software - how to write effective Fortran code
- Execution - how to compile Fortran, and deploy it to a cluster
- Operations - how to use a climate model in production, including running experiments, testing outputs.
What this paper does is show how building scientific software is about moving between these worlds. I think the contention is that while more conventional software (payroll management) has elements of all 5, the real world is easier to understand, and the theory does not require advanced math. Plus the software is likely written in a more familiar language. But scientists probably don’t have a lot of training in 3, 4, 5, at least in the surveys done so far.
Tech Debt in Scientific Software
TODO: (Arvanitou et al. 2022) - How do SE practices mitigate TD. TODO: Melina Vidoni’s papers on R packages
- (Eisty, Thiruvathukal, and Carver 2018) - a survey on how RSEs use metrics. They found that RSEs have a low knowledge of metrics, but of the ones used, performance and test metrics were most common. In appendix A they report on the types of metrics - only one respondent had heard of TD and none used it.
- (Connolly et al. 2023) argues for a focus on the Three Rs - Readablity, Resilience, and Reuse. They detail the ways in which these three things can be accomplished depending on the importance of the project, e.g., individual, group, or community. It is not explicit about technical debt except that it focuses on software ‘resilience’.
Tech Debt and External Dependencies
- Konrad Hinsen (Hinsen 2015) writes that the main issue is the dependency problem - e.g. in Konrad’s case, changes to Python 3 or new versions of Numpy.
- (Lawrence et al. 2018) writes about ‘crossing the chasm’. The old free lunch model said new improvements in the same architecture (x86 for example) would improve speed. But now need to take advantage of parallelism and multicore, which require hardware-specific optimizations. There is a very thin abstraction over the underlying hardware in these performance intensive environments, which means even end-users often need to know obscure memory architecture details to squeeze concurrency. # Types of Scientific Software
Like all software, there is no one size fits all definition of scientific software. It can span many domains, is of varying complexity, written in different languages, etc. However, broadly speaking there are hobby projects and professional projects, characterized mostly by the number of support engineers and budget for operations. A hobby project is something a single PhD student might start and is often open source. She is the only developer and it is part of the PhD research. A professional project is something like the Atlas Athena software, with hundreds of contributors, full time staff, and decades of history. And of course this is a continuum. The German Aerospace Center (DLR) has similar guidelines., where level 0 is for personal use, and level 3 is long-lived, safety critical code.
Scientific Software in Canada
The state of the practice for RSEs in Canada is pretty dire. From a government perspective, we spent a lot of time (and $$) on building infrastructure. That was connecting things with high speed networks (CANARIE) and large compute clusters (Compute Canada). Then, for murky political reasons, there was some transition from those orgs to a central one (The Digital Research Alliance). Unfortunately it seems while the tangible cluster and network stuff continues to get buy in from the main funders, Innovation, Science, and Economic Development Canada1, the software piece is harder to motivate.
Canada has no research software engineering alliance, like the UK, Germany and the US do. We have no real research labs, like the US DOE labs, and we don’t really do defence research outside of the DND Research groups. We once had software in the National Research Council, but that was axed, again, for reasons I don’t understand but had something to do with cost cutting.
Fortunately, there are some excellent folks in the space who are trying to keep things afloat, a few folks at the Alliance, and some (like me) academics. There are also top notch specialists running the clusters and software support teams at the universities, like UVic’s research computing team.
Things I’d like to know more about
- how much time does a developer spend on the “science” part of the code, and how much on ancillary roles
- Can we separate the science logic from the non-science logic?
- What is the TD inherent/possible in translating from science to software? Pub pressure, student knowledge, legacy code
- “Can we quantify or explain this loss/difference, and articulate the trade-offs resulting from translation?”
- how do we compare different scientific approaches simply from software alone?
- how do you retract/code review the scientific code?
- what is the equivalent to peer review of the code?
- what if the code is a complex model that is unexplainable? how do we test it? where is the science?
- Can we trace the way in which the design of the code has changed from its initial design to the proper current design?
- Social debt: how do we check what implications are? How does large team science play a role?
- Ciera Jaspan’s paper (Jaspan and Green 2023): tools can tell you the current indicators. But what matters is how context defines this as a problem or not. E.g., migrating to Python 3, undocumented Navier-Stokes code. How do we extract this contextual knowledge from a project?
To Read
Initiatives
- Better Scientific Software - training materials for RSEs.
- Code Refinery - more training
- Software Carpentry
- Software Sustainability Institute
- US-RSI
- NumFocus grants
- Chan/Zuckerberg grants
- Exascale Computing - Interoperable Design of Extreme-scale Application Software (IDEAS) DOE 5 year software program
- NSF large instrument group
- RESA
- IRIS-HEP
- Collegville workshops
Various “scientific software community of practice” as mentioned in the Connoly article, at UW, CMU, etc.
Venues
- Conferences, meetings, workshops
- SE4Science workshop
- Supercomputing conference workshops
- US-RSE conference - October
- UK-RSE conference - September. Why these two are so close in time is a puzzle.
- Alliance Canada Research Software conf. Now discontinued :(
- Journals
- JOSS (and unnamed proprietary journal ending in X)
- Geoscientific Model Development (GMD)
- Computing in Science & Engineering
Meta-research
Here are some papers that have looked at discipline-specific research software:
Archaeology
Zach Batist maintains open-archaeo.info, which lists open source archeology packages. In (Batist and Roe 2024) he and his co-author shows that most of the computational work is data analysis, with some packages in R for doing things like Carbon-14 calibration. There is also little apparent reuse of open source tools.
Example Projects
Meta-Listings
- https://rseng.github.io/software/
- https://open-archaeo.info
Climate
Astronomy
- SKAO
- AstroPy
- Einstein
Bio
Glossary
- RSE: Research Software Engineer
- SSI: Software Sustainability Institute
- HPC: high performance computing, e.g., ‘supercomputers’
References
Footnotes
I initially wrote this as “industry and economic development” which really gives the game away, as most of the money seems to be used in subsidizing industry in a desperate and futile attempt at improving productivity.↩︎