Semantic Werks

Thoughts on people, machines and systems.

Posts Tagged ‘xml

MSR Challenge: large files revisited

with 2 comments

This is part of the MSR Challenge series.

Ignore my earlier advice with respect to handling large XML files. My new favorite tool is Vim. Although it takes a few minutes to do it, Vim can easily go to the correct line to fix problems. So if your XML is non-validating, just figure out the line number (should be printed in the SaxException), then open Vim with vim +<line_num> <file_name>. It thinks for a while, but opens to the correct line without trouble. Then you can delete the offending characters (highlighted for me as ^S) and save the file (wait a few more minutes).

My first problem was the non-valid characters (control characters); the second problem was a very lengthy string that MySQL won’t accept (exceeds the buffer size). I’m storing the data in MySql for scale. At first, I tried increasing the buffer size for MySQL, which didn’t work. I’m not sure if this was because I didn’t set the correct variable (I’m using the MySQL administrator panel for OSX), but now I’m thinking it’s just a ridiculous amount of text: if you check out the bug report you can see the person with the problem posted 100,000 lines of empty stack trace. Thanks buddy! I’m doing a horrible string concatenation thing to eliminate newlines, which is horribly inefficient, but seems to work on small-scale bug reports.

This type of MySQL error won’t print the line number, but fortunately my parser prints the bug ids as it goes, so I could look for that id and retrieve the line number that way. My tool of choice was sed, and I just looked for:

sed -n '/[ ]360318/{=;p;}'  <file_name>, where 360318 is the bug id. Here’s where I got the syntax. The -n command suppresses output; then I look for a space followed by the bug id as the regular expression; finally the {=;p:} portion prints the match and the line number. I used a space in the regex because you would be surprised how often that six-digit number occurs in a 3 gig file.

I think I could have done this with a complicated sed program — you have to store the bug_id, and only operate on the content in that bug, and since sed is not aware of xml elements, it would involve searching for other regexs as needed. I didn’t feel like becoming a sed hacker, particularly. And yes, I suppose Emacs can manage this as well, but I haven’t tried it. I actually considered ed, since it’s line-based, but I couldn’t figure out how to get it to go the line I wanted. I’m thinking one of my programming maxims will be, “If you are considering a solution involving ed, think of another solution.”

Written by Neil

2009 February 5 at 12:08

Posted in Uncategorized

Tagged with , , ,

MSR: parsing large XML files

with one comment

This is part of the MSR Challenge series.

The Gnome bugzilla dump is an XML file file that is supposedly valid XML that is approximately 3.21 Gigawatts 3.5 gigabytes in size. Needless to say this is an impressively sized amount of data. There’s a common belief in the community that scale changes everything, and my recent encounters bear this out.

I’ve designed my parser to extract data from the file using SAX, which is a stream-based XML parser. It doesn’t try to load the model into memory like DOM, or elementTree, in other words. I tested my SAX approach on some sample data I extracted, and got that working no problem. Next step was to test it on the real thing, which is (I think) an overnight run.

Some of the lessons I’ve since learned:

  • Run long-running processes from the command-line, not Eclipse. That way you can log in from home to check the status, kill the process, etc.
  • Generate progress information, e.g., bugs processed, so one can tell that something is happening.
  • Write data to a file immediately, so that you can check the output remotely.
  • Code inspection is your friend. Errors that didn’t exist in my example (e.g., misformed dates) do exist in the data file.
  • Character handling is hard. I got some malformed data (line 50 million, usefully), which killed the parser. Now I have to figure out how to work around this.
  • The time command is your friend. It gives a tiny bit of profiling data, which is often enough.

My recent difficulty was in handling ill-formed character data. In XML, certain characters are invalid, and encountering one of them causes the parser to halt. Apparently one cannot simply route around it either. So, I’ve had to extend my codebase to filter out these annoying characters first, then parse the XML.

My first approach was a regular expression mentioned by Sam Ruby, which seems to work fine — on small files. It died on my large dataset, mainly due to my ignorance of UTF-8 and the BOM mark, so I moved on to a better method: the tr unix tool, as follows:

tr -d [:cntrl:] out.xml

On my machine (3 gigs RAM, P4 2.4), this took 5 minutes to process the file. Nice.

If you have trouble with a file’s encoding or characters, I recommend just looking at it with a hex editor like GHex2, since this will show all the strangeness; and the useful UTF-8 decoder page to see what these hex code points mean.

Written by Neil

2008 December 10 at 11:36

Posted in Uncategorized

Tagged with , ,

MSR Challenge: large data files

leave a comment »

This is part of the MSR Challenge series.

This week’s challenge was to tackle the Bugzilla logs. The data I have is a complete dump of the Gnome bugzilla database, which is enormous — 3.5 gigs. Of course, I don’t need all of this data, just the data that relates to particular projects (e.g., Deskbar).

I think there are two approaches. One is bottom-up, so to speak, where I focus on the details of a specific bug. For this study, I need to extract the date the bug was filed and the subject and comment for that posting. Bugzilla allows follow-up comments, so I am going to extract those as separate GnomeDataObjects as well. For now, I am discarding metadata such as the bug’s status and whether it is an error or enhancement (other researchers are working there). I’m just focusing on the bug comments/description and the date of that comment. If the comment is a system event, such as “BUG XXX is a duplicate of YYY” then I ignore those events.

So I have some hacky code that reads in my sample xml file and parses the relevant nodes to extract that data. Hacky, because I’m too lazy to write a search for the node I want, and I’m just using list positions (e.g., entry.ChildNodes[4].childNodes[0].data where entry is a bug).

As I was doing this, I realized that this isn’t going to scale too well on a 3 gigabyte file. Rather than do a time-consuming iteration over all the bugs, throwing out those that aren’t in the appropriate project, I need to do some sort of stream processing … which is where SAX comes in.

My other consideration was whether I should just munge this all into a huge database, but I can’t find a simple enough way to do that online. The third alternative would be to write XSLT to split the file or reject the unwanted elements, but I don’t know XSLT at all. And from all accounts, I’m better off that way :)

SAX is like an impoverished event-processor: it calls specific methods (e.g., beginElement) when that event occurs (e.g., it encounters an XML element). This is cool, in that it’s like just-in-time processing, and avoids a huge in-memory data structure; on the other hand, your code (at least MY code) is full of if statements and switches, which is crapulent. But it looks to be working, for now.
Here’s some sample data as an illustration:


    2002-05-28 16:29:00

    scaffold



       xxxx@hotmail.com

        2002-05-28 15:31:00

        Package: anjuta2
Severity: normal
Version: 0.11.3
Synopsis: The Matrix
Bugzilla-Product: anjuta2
Bugzilla-Component: shell

Description:
I was in the site www.thematrix.com and the browser crached !

Thkz

Debugging Information:
[some stack trace]

       xxxx@hmc.edu

        2002-07-29 14:03:00

*** This bug has been marked as a duplicate of 59361 **

Written by Neil

2008 November 14 at 11:03

Posted in Uncategorized

Tagged with , ,

Follow

Get every new post delivered to your Inbox.

Join 198 other followers