Wednesday, October 31, 2007

An Inconvenient Truth

Lot friends of mines were telling me that I should watch that movie (also in wikipedia). I recently had the chance! And it really worth!

It's a documentary film about climate change, specifically global warming, presented by former United States Vice President Al Gore. There are lots of numbers and facts that can make you reconsider some aspects of your living.

So I urge you to watch this movie!


Hello again ;)
I ran into a tricky situation last couple of days. I have some big tables with varchar fields where I have to JOIN ON varchar fields. The problem does not concern only the JOIN - it is a general problem using varchar / text fields used to be searched one way or another.

So - the situation: We have one table with about 1M rows (relatively small amount data for each row ~90 bytes) on one hand. And another table with records to be matched - about 100k rows (relatively medium amount of data for each row ~ 500 bytes). We do a SELECT which JOINs the two tables based on VARCHAR columns. Do not ask "why?" - This is the business logic; it cannot be otherwise (at least in the beginning). If we index the columns we will got the results fast! What is fast? Less than 50 ms to get around 200 joined rows. Well, OK, you got me - it depends strongly on the hardware configuration and MySQL settings. Let's assume this is our limit. I was very happy to discover that it works that fast. The speed is critical, because we will need to join around 100k rows.

Where is the problem then? Well, it happened that the first table is populated with another several hundred thousand records and it became with about 1.9M rows. And? And, it appeared that the MySQL should copy the temp table on the hard drive. As you can guess - the result is: 204 joined rows (not 204k, but only two hundred and four) for ..... 417 seconds. What about 100k rows? Killing me.

What can we do about that? Well, I know (the business logic predicts) that the first table - that with about 2M rows would never have more than 4 million rows. I also noticed that partitioning is only partially supported in the beta (RC yet) of MySQL 5.1.x, but the recommended for production environment is 5.0.x which does not support partitioning. So, I discovered the great options MAX_ROWS and AVG_ROW_LENGTH. Which does what - it says to the MySQL engine, that this table would never have more than MAX_ROWS tows, and the average row length would be AVG_ROW_LENGTH. You can find the average row length by executing the command SHOW TABLE STATUS FROM [DB_NAME] LIKE 'TBL_NAME'. So, if you know, if your business logic predicts that you will have a large table, BUT this table will have relatively known content (like it will have not more than 4M rows), and you know the average row length, you can set the MAX_ROWS and AVG_ROW_LENGTH for that table. The AVG_ROW_LENGTH is required if you have blob/text/varchar fields.

That will cause the MySQL engine to optimize the space required for that table, also optimizing the myisam_data_pointer_size, which sets the number of bytes used for internal row pointers. Yes - it works only with MyISAM tables. But you will be surprised. After I set the MAX_ROWS and AVG_ROW_LENGTH on that table with the 2M records - I got a surprising results - 90k joined rows for ..... 47 seconds !!

I think I will use these options often in the future, what about you?