Posted by: Neil in General
A recent article in IEEE Software magazine studied the impacts of IEEE Software papers from 1984 onwards. The paper, well-hidden behind the cost wall, has a list of the 25 most cited articles in the magazine. I thought I’d reproduce the top 10 here with links to the abstracts, or, where possible, the paper itself. For those unfamiliar with the difference, a magazine article is typically a less-academic summarization of scholarly work, directed at both academics and practitioners.
Credit belongs to Daniel O’Leary, whose work this is.
- E.J. Chikofsky and J.H. Cross, “Reverse Engineering and Design Recovery: A Taxonomy,” vol. 7, no. 1, 1990, pp. 13–17.
- M.T. Heath and J.A. Etheridge, “Visualizing the Performance of Parallel Programs,” vol. 8, no. 5, 1991, pp. 29–39.
- P.B. Kruchten, “The 4+1 View of Architecture,” vol. 12, no. 6, 1995, pp. 42–50.
- M.C. Paulk et al., “Capability Maturity Model, Version 1.1,” vol. 10, no. 4, 1993, pp. 18–27. [full report]
- A. Hall, “7 Myths of Formal Methods,” vol. 7, no. 5, 1990, pp. 11–19.
- B. Boehm, “Software Risk Management: Principles and Practices,” vol. 8, no. 1, 1991, pp. 32–41.
- R. Prieto-Diaz and P. Freeman, “Classifying Software for Reusability,” vol. 4, no. 1, 1987, pp. 6–16.
- D.R. Cheriton, “The V-Kernel: A Software Base for Distributed Systems,” vol. 1, no. 2, 1984, pp. 19–42.
- J.D. Musa, “Operation Profiles in Software Reliability Engineering,” vol. 10, no. 2, 1993, pp. 14–32.
- C. Potts, K. Takahashi, and A.I. Anton, “Inquiry-Based Requirements Analysis,” vol. 11, no. 2, 1994, pp. 21–32.
O’Leary, Daniel, “The Most Cited IEEE Software Articles,” Software, IEEE , vol.26, no.1, pp.12-14, Jan.-Feb. 2009
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4721174&isnumber=4721166
Tags: citations, software
No Comments »
Posted by: Neil in General
This is part of the MSR Challenge series.
The Gnome bugzilla dump is an XML file file that is supposedly valid XML that is approximately 3.21 Gigawatts 3.5 gigabytes in size. Needless to say this is an impressively sized amount of data. There’s a common belief in the community that scale changes everything, and my recent encounters bear this out.
I’ve designed my parser to extract data from the file using SAX, which is a stream-based XML parser. It doesn’t try to load the model into memory like DOM, or elementTree, in other words. I tested my SAX approach on some sample data I extracted, and got that working no problem. Next step was to test it on the real thing, which is (I think) an overnight run.
Some of the lessons I’ve since learned:
- Run long-running processes from the command-line, not Eclipse. That way you can log in from home to check the status, kill the process, etc.
- Generate progress information, e.g., bugs processed, so one can tell that something is happening.
- Write data to a file immediately, so that you can check the output remotely.
- Code inspection is your friend. Errors that didn’t exist in my example (e.g., misformed dates) do exist in the data file.
- Character handling is hard. I got some malformed data (line 50 million, usefully), which killed the parser. Now I have to figure out how to work around this.
- The
time command is your friend. It gives a tiny bit of profiling data, which is often enough.
My recent difficulty was in handling ill-formed character data. In XML, certain characters are invalid, and encountering one of them causes the parser to halt. Apparently one cannot simply route around it either. So, I’ve had to extend my codebase to filter out these annoying characters first, then parse the XML.
My first approach was a regular expression mentioned by Sam Ruby, which seems to work fine — on small files. It died on my large dataset, mainly due to my ignorance of UTF-8 and the BOM mark, so I moved on to a better method: the tr unix tool, as follows:
tr -d [:cntrl:] < in.xml > out.xml
On my machine (3 gigs RAM, P4 2.4), this took 5 minutes to process the file. Nice.
If you have trouble with a file’s encoding or characters, I recommend just looking at it with a hex editor like GHex2, since this will show all the strangeness; and the useful UTF-8 decoder page to see what these hex code points mean.
Tags: python, unicode, xml
No Comments »
Posted by: Neil in General
See here: Deadline Iraq: Untold Stories of the Iraq War. It’s a good documentary, but my problem with it is that I don’t think this type of unnarrated exposition does enough to raise the important questions. We sort of flit from scene to scene: a firefight, a missile explosion, a hospital… but what is the connection? What message should we take from it all? A few of mine:
War seems so banal in the videos, unreal, just a bunch of guys yelling at each other. Bullets fly through the air with a muted buzz. These sounds and images can’t possibly capture the fear and terror and adrenaline flowing. How can we do this?
Are embeds soldiers or journalists? The big tragedy is that some journalists have been shot in the Palestine Hotel? Amid all the other shit going on in Iraq? The torture before the invasion, the murders after the invasion?
In every country, you can find any thousand people willing to say what you want. So photos of cheering, jeering, dead civilians, and so on, really don’t tell us anything. I think this still affects our coverage of the Middle East. I think what is most telling is Peter Mansbridge’s comment, that ideally you would be able to go anywhere you want on the battlefield, but that you cannot, so you make do. Sometimes this means accepting what the local authorities are willing, or want, to show.
Maher Abdullah: there is no such thing as an objective journalist. And if that’s the case, then what is the point? Presumably the point of reportage on these situations is to give the decision-makers, the public, all the information they need. But the people on the frontlines only send these stories to their editors, and then the editors make decisions about how those photos will sell papers. On top of that, there are Pentagon apologists who get paid to go on CNN/Fox/MSNBC and offer their ‘analysis’ of the war. One minute of their (unacknowledged) biased opinions can wreck any useful data coming out of the theatre of war. I’m sure the reason the Pentagon allowed embeds is because the staffers in Washington realize the average journalist isn’t a threat, so long as they can steer the distribution back home.
One of the journalists claims this was the best-covered war in history, but what is the coverage? If the public never sees dead Americans, or gets the meaningful, in depth analyses (monthly death tolls, economic ruin, etc), then what is the point? That we see some neat photos and videos, make some TV shows, and on we go?
Perhaps blogs are one way to address these problems. People like Michael Yon, who are freelance and self-publish, and most importantly, are very clear about their biases. He’s resolutely pro-military, but he still brings some important data points to the conflicts he covers.
Tags: iraq, journalism
No Comments »
|