Posts Tagged ‘mysql’
MSR Challenge: large files revisited
This is part of the MSR Challenge series.
Ignore my earlier advice with respect to handling large XML files. My new favorite tool is Vim. Although it takes a few minutes to do it, Vim can easily go to the correct line to fix problems. So if your XML is non-validating, just figure out the line number (should be printed in the SaxException), then open Vim with vim +<line_num> <file_name>. It thinks for a while, but opens to the correct line without trouble. Then you can delete the offending characters (highlighted for me as ^S) and save the file (wait a few more minutes).
My first problem was the non-valid characters (control characters); the second problem was a very lengthy string that MySQL won’t accept (exceeds the buffer size). I’m storing the data in MySql for scale. At first, I tried increasing the buffer size for MySQL, which didn’t work. I’m not sure if this was because I didn’t set the correct variable (I’m using the MySQL administrator panel for OSX), but now I’m thinking it’s just a ridiculous amount of text: if you check out the bug report you can see the person with the problem posted 100,000 lines of empty stack trace. Thanks buddy! I’m doing a horrible string concatenation thing to eliminate newlines, which is horribly inefficient, but seems to work on small-scale bug reports.
This type of MySQL error won’t print the line number, but fortunately my parser prints the bug ids as it goes, so I could look for that id and retrieve the line number that way. My tool of choice was sed, and I just looked for:
sed -n '/[ ]360318/{=;p;}' <file_name>, where 360318 is the bug id. Here’s where I got the syntax. The -n command suppresses output; then I look for a space followed by the bug id as the regular expression; finally the {=;p:} portion prints the match and the line number. I used a space in the regex because you would be surprised how often that six-digit number occurs in a 3 gig file.
I think I could have done this with a complicated sed program — you have to store the bug_id, and only operate on the content in that bug, and since sed is not aware of xml elements, it would involve searching for other regexs as needed. I didn’t feel like becoming a sed hacker, particularly. And yes, I suppose Emacs can manage this as well, but I haven’t tried it. I actually considered ed, since it’s line-based, but I couldn’t figure out how to get it to go the line I wanted. I’m thinking one of my programming maxims will be, “If you are considering a solution involving ed, think of another solution.”
MSR Challenge: data cleaning
This is part of the MSR Challenge series.
I’ve been spending the last few days writing my parsing code for the Gnome data sets. Friday and Sunday have been dedicated to parsing mailing list data. I got a dump of a sample mailing list, the Deskbar Applet list. It’s in MySQL format, so I loaded it into Mysql without any major trouble. I then imported the python MySQLdb (case matters!) library, and got to work.
I open a connection to the db with the ‘deskbar’ table, then run a query: "SELECT message_body, original_date, subject FROM messages". I then do cursor.fetchall() to retrieve the rows (1091 rows). Since I used a MySQLdb.cursors.DictCursor, the resultset is a dictionary with the column names as the keys. Now it’s a simple matter of passing those things into my GnomeDataObject schema (event, date, RSN) as appropriate. Again, simple, right?
Well, no. I usually try all this from the REPL environment in Python, and occasionally I would see this strange result for the message body: '[<email.Message.Message instance at 0xb76e324c>, <email.Message.Message instance at 0xb769a90c>]'. What puzzled me was that this is actually Python, using the email system library. At first I thought the mysql connector was doing type conversion, but sadly, this is really just a string.
So the message that is supposed to be like this is output into the db file as some sort of weird serialization result. From what I can tell so far, that’s because the person who wrote the dump script against the list mbox files made a mistake when parsing some special type of message, threaded or something like that.
From inspection, this problem affects around 25% of the messages in the list. This is definitely a serious problem for validity, although if it is random (and not, for example, just occuring to one mailing list contributor, like one who is really interested in some NFR), then no biggy.
My next step will be to try and reproduce this with another mailing list dump file. If it’s still a problem, I’ll need to contact the person who produced the data or else just extract the data from the mbox files myself.