Scientific Software Development

Structured Bibliography


Neil Ernst



April 25, 2024


We’ve been doing a lot of research in my group around scientific software and technical debt, funded by the Sloan Foundation. As part of that work, I’ve written a post about the topic, along with some open research questions. While it is mainly for our own use, perhaps others will find this helpful. It is mainly an extended lit review somewhat loosely organized into themes.

Scientific software, or research software, is software that is written to support a scientific endeavour, including data collection, modeling and simulation, end user support, and others. Examples include codes for modeling climate changes, calculating shear forces in buildings, managing observation time on telescopes.

Other definitions: FAIR for Research Software, the DLR categories, (Hasselbring et al. 2024). The main challenge is that some research software moves: it may start as DLR application 0 (personal), but over time ends up in AC2 (long term libraries) or 3 (safety critical). AstroPy is a good example: it consists of common astronomical operations that many astronomers use, but started as core functions in individual projects.

My Journey

I worked in spatial analysis in undergrad: as an intern, mapping water rights, goat habitat, producing maps for the coast guard. I then worked in my masters with ontologies for cancer biology. In my time at the SEI, I was lucky to work on early planning for the US research software sustainment program from the NSF, which introduced me to the US movement for RSEs, with people like Dan Katz and Jeff Carver.

Since joining the university as a professor, I have been intrigued by the challenges of devloping complex software, in particular, how design choices influence subsequent problems, i.e., technical debt. This also aligns with a wider sense I have that software for climate modeling is a key capability for our future.

Early Work

The earliest work in RSE is basically the field of numerical analysis and HPC. In some sense, the entire discipline of software engineering is derived from scientific software, using simulations to model nuclear weapons, do weather forecasting, etc.

The field of RSE seems to have started between 2000-2005 or so. I think this coincides with general availability of software and the internet and web to connect people. There was a lot of focus on e.g. computational workflows at the time from Carole Goble. People were working on these issues before (e.g., at large US national nuclear labs such as Los Alamos) but there wasn’t a clear definition of the job area.

Jeff Carver’s first workshop is around 2008.

(Segal 2009) is an early paper at CSCW looking at the socio-cultural dimension. A few interesting things:

  • One, the idea that the main issue is to support the science. This is the key requirement.
  • There is an early focus in some of these papers on the scientist as end-user programmer. This was a popular notion in the 2000s that I think was on the one hand an utter failure (e.g. VBA), but on the other hand, just natural with the right UI (e.g., Excel formulas). And AI will make this much easier. Greg Wilson said something about how when he was teaching scientists in the 2000s, he struggled to motivate why they should not use Excel. Diane Kelly pushes back on this - of which more below.
  • A really detailed ethnography on a single lab management project.
  • As with most software projects, this one struggled with team dynamics and power.

A followup is (Segal and Morris 2012). This paper is similar to the preceding one, except for a more generic focus on the main differences with conventional development, and the ubiquitous (for the time) focus on Agile.

I should digress here for a moment and say I don’t find software process very interesting. The models - Scrum, Lean, Waterfall, etc. are all highly idealized and in my experience rarely followed in practice. Asking if we should be “agile” is answering the wrong question. You want to deliver things faster and with higher quality, so looking at practices to help that is the key. Anyhoo.

James Herbsleb did some work as well e.g. (Howison and Herbsleb 2013). They looked at incentives.

What Makes Scientific Software Developers Different?

Scientific software has seems to have a different context than a lot of other code. For example, it might be continually maintained. It has different testing needs. RSEs therefore have different approaches. Side note: I am not yet convinced scientific software really is different. A lot of the issues–complex domain requirements, performance intensive problems, team composition–are common in other domains as well.

(Pinto, Wiese, and Dias 2018) and (Wiese, Polato, and Pinto 2020) both report on a survey replication on scientific software developers. Nothing jumped out at me - the problems seem mostly similar to normal development; library problems, stable requirements, etc. Surprising to me was that only 5% of the issues seem science related. But I wonder if that is because the questions were not clear. If one asked where the major cost and development effort is spent, or talked to end users …

(Cosden, McHenry, and Katz 2023) is a survey of RSEs that looks at how they get into the field. Most are domain experts (75%) and the rest are CS grads. Both have different education challenges.

(Carver et al. 2022) continued this; again the findings are mostly interesting as a catalog of the state of things.

A lot of the differences, from what I can tell, is that for a long time RSE was not a career path, and so those folks were temporary, poorly paid, and not recognized.

Testing Scientific Software

(Carver et al. 2007) did a series of case studies looking at how scientific code was maintained, and why Fortran was so popular. I think this paper may have harmed the field, by downplaying some of the complexity involved. It comes across as “this is pretty easy stuff”. But some big projects in the DOD space are studied. Around the same time came (Basili et al. 2008). An observation of that paper is that

  • although HPC is focused on “performance” for scientists the performance of the code is less interesting than “time to result”, i.e., a publishable outcome. That time spans writing the code, testing it, running the simulation, etc.
  • Validation is tricky; outputs are often non-deterministic and probabilistic, inherent in simulating and modelling complex phenomena.
  • Programs are long-lived so there is deep scepticism of new tools as plenty of tools cannot make it 30 years. Funders authorizing Voyager, SKA, CERN’s LHC expect the billions of dollars to be used for decades. The software should be able to match that.
  • Programmers love being close to the metal, to keep things speedy.

A related paper is (Hook and Kelly 2009), which, while focused on mutation testing, has a nice figure showing the ways error can make its way into scientific code.

TODO: (Babuska and Oden 2004) TODO: (Eisty and Carver 2022)

Theories and Models

(Jay et al. 2020) reports on a workshop on translating scientific theories into code. I feel like this is where my interest is most piqued at the moment.

In addition to addressing the general difficulties common to all software development projects, research software must represent, manipulate, and provide data for complex theoretical constructs. Such a construct may take many forms: an equa- tion, a heuristic, a method, a model; here we encapsulate all of these, and others, in the term theory.

They point out the various places things can happen badly - in the sceince, in the code, and in the translation.

The whole idea of scientific computing is to test an imperfect theory of the (natural) world. As such the code and the theory often tradeoff:

Although it is natural to think (and is most often indeed the case) that one needs to formulate the equations and then apply computational algorithms to obtain the numerical solutions, the formulation of the equations can be affected by the choice of computational method. Cf. the simulations books

This blog post covers some of the early papers here in detail, although gets the intuition of chaos theory incorrect. Distinguishes between “error of measurement” and “error of specification”, looking at the tradeoffs between making models more accurate, but also more likely to cause issues with measurement error compounding.

Types of models/theories

  1. Mental model of code: (Naur 1985)’s idea of theory.

    • the code encapsulates a theory, that different people come to different, ideally shared, understandings of.
    • each developer then adds to that theory his/her understanding.
    • The theory (embedded in the software) is refined and adapted over time, e.g., with refactoring, new features, bugs, etc.
    • The code in turn relies on different theories in architecture of the hardware, programming language, and packages/dependencies (e.g., what access control means)
    • The theory might be encoded as a conceptual model, using a model-driven step as well, e.g., Simulink or Matlab code generation.
    • There might be an explicit model the software presents for the science it encodes (“climate simulation using a 1km grid”), and another for ancillary functions.
  2. A scientist has a theory, which the code should help to test/validate/confirm (choose your epistemological poison).

  3. The end users have requirements and expectations of the code, as they use it.

Domains of Knowledge

I really liked this insight from an early research in RSE, Diane Kelly. (Kelly 2015) summarizes work on nuclear scientists in Canada and identifies knowledge domains. I’ll use climate models as an example:

  1. Real world - how the carbon cycle works, solar radiation, forcings, etc.
  2. Theory - the math underlying climate, e.g. differential equations, Navier-Stokes, thermodynamics.
  3. Software - how to write effective Fortran code
  4. Execution - how to compile Fortran, and deploy it to a cluster
  5. Operations - how to use a climate model in production, including running experiments, testing outputs.

What this paper does is show how building scientific software is about moving between these worlds. I think the contention is that while more conventional software (payroll management) has elements of all 5, the real world is easier to understand, and the theory does not require advanced math. Plus the software is likely written in a more familiar language. But scientists probably don’t have a lot of training in 3, 4, 5, at least in the surveys done so far.

Tech Debt in Scientific Software

TODO: (Arvanitou et al. 2022) - How do SE practices mitigate TD. TODO: Melina Vidoni’s papers on R packages

  • (Eisty, Thiruvathukal, and Carver 2018) - a survey on how RSEs use metrics. They found that RSEs have a low knowledge of metrics, but of the ones used, performance and test metrics were most common. In appendix A they report on the types of metrics - only one respondent had heard of TD and none used it.
  • (Connolly et al. 2023) argues for a focus on the Three Rs - Readablity, Resilience, and Reuse. They detail the ways in which these three things can be accomplished depending on the importance of the project, e.g., individual, group, or community. It is not explicit about technical debt except that it focuses on software ‘resilience’.

Tech Debt and External Dependencies

  • Konrad Hinsen (Hinsen 2015) writes that the main issue is the dependency problem - e.g. in Konrad’s case, changes to Python 3 or new versions of Numpy.
  • (Lawrence et al. 2018) writes about ‘crossing the chasm’. The old free lunch model said new improvements in the same architecture (x86 for example) would improve speed. But now need to take advantage of parallelism and multicore, which require hardware-specific optimizations. There is a very thin abstraction over the underlying hardware in these performance intensive environments, which means even end-users often need to know obscure memory architecture details to squeeze concurrency. # Types of Scientific Software

Like all software, there is no one size fits all definition of scientific software. It can span many domains, is of varying complexity, written in different languages, etc. However, broadly speaking there are hobby projects and professional projects, characterized mostly by the number of support engineers and budget for operations. A hobby project is something a single PhD student might start and is often open source. She is the only developer and it is part of the PhD research. A professional project is something like the Atlas Athena software, with hundreds of contributors, full time staff, and decades of history. And of course this is a continuum. The German Aerospace Center (DLR) has similar guidelines., where level 0 is for personal use, and level 3 is long-lived, safety critical code.

Scientific Software in Canada

The state of the practice for RSEs in Canada is pretty dire. From a government perspective, we spent a lot of time (and $$) on building infrastructure. That was connecting things with high speed networks (CANARIE) and large compute clusters (Compute Canada). Then, for murky political reasons, there was some transition from those orgs to a central one (The Digital Research Alliance). Unfortunately it seems while the tangible cluster and network stuff continues to get buy in from the main funders, Innovation, Science, and Economic Development Canada1, the software piece is harder to motivate.

Canada has no research software engineering alliance, like the UK, Germany and the US do. We have no real research labs, like the US DOE labs, and we don’t really do defence research outside of the DND Research groups. We once had software in the National Research Council, but that was axed, again, for reasons I don’t understand but had something to do with cost cutting.

Fortunately, there are some excellent folks in the space who are trying to keep things afloat, a few folks at the Alliance, and some (like me) academics. There are also top notch specialists running the clusters and software support teams at the universities, like UVic’s research computing team.

Things I’d like to know more about

  1. how much time does a developer spend on the “science” part of the code, and how much on ancillary roles
  2. Can we separate the science logic from the non-science logic?
    1. What is the TD inherent/possible in translating from science to software? Pub pressure, student knowledge, legacy code
    2. “Can we quantify or explain this loss/difference, and articulate the trade-offs resulting from translation?”
  3. how do we compare different scientific approaches simply from software alone?
  4. how do you retract/code review the scientific code?
    1. what is the equivalent to peer review of the code?
    2. what if the code is a complex model that is unexplainable? how do we test it? where is the science?
  5. Can we trace the way in which the design of the code has changed from its initial design to the proper current design?
  6. Social debt: how do we check what implications are? How does large team science play a role?
  7. Ciera Jaspan’s paper (Jaspan and Green 2023): tools can tell you the current indicators. But what matters is how context defines this as a problem or not. E.g., migrating to Python 3, undocumented Navier-Stokes code. How do we extract this contextual knowledge from a project?

To Read


Various “scientific software community of practice” as mentioned in the Connoly article, at UW, CMU, etc.


Example Projects




  • Biology
  • rOpenSci and relevant paper
  • PsychoPy: This project is related to psychology and neuroscience.
  • biopython: This project is related to Molecular Biology.
  • RDKit: This project is related to Chemistry Informatics


  • RSE: Research Software Engineer
  • SSI: Software Sustainability Institute
  • HPC: high performance computing, e.g., ‘supercomputers’


Arvanitou, Elvira-Maria, Nikolaos Nikolaidis, Apostolos Ampatzoglou, and Alexander Chatzigeorgiou. 2022. “Practitioners’ Perspective on Practices for Preventing Technical Debt Accumulation in Scientific Software Development.” In Proceedings of the 17th International Conference on Evaluation of Novel Approaches to Software Engineering. SCITEPRESS - Science; Technology Publications.
Babuska, Ivo, and J.Tinsley Oden. 2004. “Verification and Validation in Computational Engineering and Science: Basic Concepts.” Computer Methods in Applied Mechanics and Engineering 193 (36-38): 4057–66.
Basili, Victor R., Jeffrey C. Carver, Daniela Cruzes, Lorin M. Hochstein, Jeffrey K. Hollingsworth, Forrest Shull, and Marvin V. Zelkowitz. 2008. “Understanding the High-Performance-Computing Community: A Software Engineers Perspective.” IEEE Software 25 (4): 29–36.
Carver, Jeffrey C., Richard P. Kendall, Susan E. Squires, and Douglass E. Post. 2007. “Software Development Environments for Scientific and Engineering Software: A Series of Case Studies.” In 29th International Conference on Software Engineering (ICSE07). IEEE.
Carver, Jeffrey C., Nic Weber, Karthik Ram, Sandra Gesing, and Daniel S. Katz. 2022. “A Survey of the State of the Practice for Research Software in the United States.” PeerJ Computer Science 8 (May): e963.
Connolly, Andrew, Joseph Hellerstein, Naomi Alterman, David Beck, Rob Fatland, Ed Lazowska, Vani Mandava, and Sarah Stone. 2023. Software Engineering Practices in Academia: Promoting the 3Rs—Readability, Resilience, and Reuse.” Harvard Data Science Review 5 (2).
Cosden, Ian A., Kenton McHenry, and Daniel S. Katz. 2023. “Research Software Engineers: Career Entry Points and Training Gaps.” Computing in Science & Engineering, 1–9.
Eisty, Nasir U., and Jeffrey C. Carver. 2022. “Testing Research Software: A Survey.” Arxiv:2205.15982.
Eisty, Nasir U., George K. Thiruvathukal, and Jeffrey C. Carver. 2018. “A Survey of Software Metric Use in Research Software Development.” In 2018 IEEE 14th International Conference on e-Science (e-Science). IEEE.
Hasselbring, Wilhelm, Stephan Druskat, Jan Bernoth, Philine Betker, Michael Felderer, Stephan Ferenz, Anna-Lena Lamprecht, Jan Linxweiler, and Bernhard Rumpe. 2024. “Toward Research Software Categories.”
Hinsen, Konrad. 2015. “Technical Debt in Computational Science.” Computing in Science & Engineering 17 (6): 103–7.
Hook, Daniel, and Diane Kelly. 2009. “Mutation Sensitivity Testing.” Computing in Science & Engineering 11 (6): 40–47.
Howison, James, and James D. Herbsleb. 2013. “Incentives and Integration in Scientific Software Production.” In Proceedings of the 2013 Conference on Computer Supported Cooperative Work, 459–70. CSCW ’13. New York, NY, USA: Association for Computing Machinery.
Jaspan, Ciera, and Collin Green. 2023. “Defining, Measuring, and Managing Technical Debt.” IEEE Software 40 (3): 15–19.
Jay, Caroline, Robert Haines, Daniel S. Katz, Jeffrey C. Carver, Sandra Gesing, Steven R. Brandt, James Howison, et al. 2020. “The Challenges of Theory-Software Translation.” F1000Research 9 (October): 1192.
Kelly, Diane. 2015. “Scientific Software Development Viewed as Knowledge Acquisition: Towards Understanding the Development of Risk-Averse Scientific Software.” Journal of Systems and Software 109 (November): 50–61.
Lawrence, Bryan N., Michael Rezny, Reinhard Budich, Peter Bauer, Jörg Behrens, Mick Carter, Willem Deconinck, et al. 2018. “Crossing the Chasm: How to Develop Weather and Climate Models for Next Generation Computers?” Geoscientific Model Development 11 (5): 1799–1821.
Naur, Peter. 1985. “Programming as Theory Building.” Microprocessing and Microprogramming 15 (5): 253–61.
Pinto, Gustavo, Igor Wiese, and Luiz Felipe Dias. 2018. “How Do Scientists Develop Scientific Software? An External Replication.” In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE.
Segal, Judith. 2009. “Software Development Cultures and Cooperation Problems: A Field Study of the Early Stages of Development of Software for a Scientific Community.” Computer Supported Cooperative Work (CSCW) 18 (5-6): 581–606.
Segal, Judith, and Chris Morris. 2012. “Developing Software for a Scientific Community.” In Handbook of Research on Computational Science and Engineering, 177–96. IGI Global.
Wiese, Igor, Ivanilton Polato, and Gustavo Pinto. 2020. “Naming the Pain in Developing Scientific Software.” IEEE Software 37 (4): 75–82.


  1. I initially wrote this as “industry and economic development” which really gives the game away, as most of the money seems to be used in subsidizing industry in a desperate and futile attempt at improving productivity.↩︎