Semantic Werks

Thoughts on people, machines and systems.

MSR Challenge: large files revisited

with 2 comments

This is part of the MSR Challenge series.

Ignore my earlier advice with respect to handling large XML files. My new favorite tool is Vim. Although it takes a few minutes to do it, Vim can easily go to the correct line to fix problems. So if your XML is non-validating, just figure out the line number (should be printed in the SaxException), then open Vim with vim +<line_num> <file_name>. It thinks for a while, but opens to the correct line without trouble. Then you can delete the offending characters (highlighted for me as ^S) and save the file (wait a few more minutes).

My first problem was the non-valid characters (control characters); the second problem was a very lengthy string that MySQL won’t accept (exceeds the buffer size). I’m storing the data in MySql for scale. At first, I tried increasing the buffer size for MySQL, which didn’t work. I’m not sure if this was because I didn’t set the correct variable (I’m using the MySQL administrator panel for OSX), but now I’m thinking it’s just a ridiculous amount of text: if you check out the bug report you can see the person with the problem posted 100,000 lines of empty stack trace. Thanks buddy! I’m doing a horrible string concatenation thing to eliminate newlines, which is horribly inefficient, but seems to work on small-scale bug reports.

This type of MySQL error won’t print the line number, but fortunately my parser prints the bug ids as it goes, so I could look for that id and retrieve the line number that way. My tool of choice was sed, and I just looked for:

sed -n '/[ ]360318/{=;p;}'  <file_name>, where 360318 is the bug id. Here’s where I got the syntax. The -n command suppresses output; then I look for a space followed by the bug id as the regular expression; finally the {=;p:} portion prints the match and the line number. I used a space in the regex because you would be surprised how often that six-digit number occurs in a 3 gig file.

I think I could have done this with a complicated sed program — you have to store the bug_id, and only operate on the content in that bug, and since sed is not aware of xml elements, it would involve searching for other regexs as needed. I didn’t feel like becoming a sed hacker, particularly. And yes, I suppose Emacs can manage this as well, but I haven’t tried it. I actually considered ed, since it’s line-based, but I couldn’t figure out how to get it to go the line I wanted. I’m thinking one of my programming maxims will be, “If you are considering a solution involving ed, think of another solution.”

Written by Neil

2009 February 5 at 12:08

Posted in Uncategorized

Tagged with , , ,

2 Responses

Subscribe to comments with RSS.

  1. What if a bad character occurs in a file with no carriage returns or line feeds (common in xml messaging)?

    Will vim show a 3GB file in a tree?

    Bill Conniff

    2010 February 28 at 16:17

    • The files I had did not have CR/LF characters (which are just whitespace elements in XML). And the question is not “will VIM show it” but rather “how long do you want to wait”. I certainly wouldn’t want to edit that size file using XML mode.

      Neil

      2010 March 1 at 09:59


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 384 other followers