Anton Staykov's Blog

Thursday, December 20, 2007

SQL, Joins, Unions and more .. continued II

Last night I decided to let MySQL think about that query ... It seemed to have completed calculations in the morning. Not surprisingly it has returned the very same data like did the MS SQL - The multiplication of the results. But it has taken 6604156 milliseconds (~6604 seconds) to obtain the result. The same result that MS SQL EXPRESS returned in 274709 milliseconds (~275 seconds).

No further comments.
Case closed.

Wednesday, December 19, 2007

SQL, Joins, Unions and more .. continued

Hi again.

Today I made a test - transferred the data into MS SQL Server 2005 Express SP 2.0.

Guess What ? :)

The situation is almost the same ! Wondering why the result was messed, the execution time was arround '274709' - I guess this is the "ms" - "duration" column in the MS SQL Server Profiler. Which is arround 5 minutes.
The mess I am talking about is the following:
The situation is that in of the tables I have about 120k records, which should be displayed, on the other one I have about 5k records. Insted of showing "120000" for "di_count", and "5000" for "ei_count", in the result I got '594163480' (which is the number of rows in one of the left joined tables multiplied by the number of rows in the other one) for each of them - the same figure for the two columns. What is that ?

I just want to make a select from one table, than join second table, then join third table to the FIRST one, then join the last table AGAIN to THE FIRST one. Why is it joining the result alltogether?

Anyway .. using UNION, VIEW and work with the VIEW solves the problem, but I was just wondering why this happens ...

I j ust can't help showing you these two screens:

For the MSSQL Profiler:

And for the MySQL (note: the execution is still not finished!):

I have no comment!

Tuesday, December 18, 2007

SQL, Joins, Unions and more

Ok, this is will be about MySQL once again.

I had that problem. Let's assume we have 3 tables:
1. smetki
2. di_invoices
3. dk_invoices
4. ei_invoices
Each of the XX_invoices has a "foreign_key" to PK of "smetki". I am using MyISAM.
If I perform the following 3 queries separately:
---------------------------------------------------------
select
s.smetka_id, s.created_on,
count(di.invoice_id) as di_count
from
smetki as s
left outer join di_invoices as di on (s.smetka_id = di.smetka_id)
where
s.created_on between '2007-12-10' and '2007-12-15'
group by
s.smetka_id
-----------------------------------------------------------
select
s.smetka_id, s.created_on,
count(dk.invoice_id) as dk_count
from
smetki as s
left outer join dk_invoices as dk on (s.smetka_id = dk.smetka_id)
where
s.created_on between '2007-12-10' and '2007-12-15'
group by
s.smetka_id
------------------------------------------------------------
select
s.smetka_id, s.created_on,
count(ei.invoice_id) as ei_count
from
smetki as s
left outer join ei_invoices as ei on (s.smetka_id = ei.smetki_id)
where
s.created_on between '2007-12-10' and '2007-12-15'
group by
s.smetka_id
------------------------------------------------------------
I have no problem, I have proper indexes and I got the results for less then 200 ms from each query.
But I need the 3 results into 1 single resultset so I decided to make a join like that:
select
s.smetka_id, s.created_on,
count(di.invoice_id) as di_count,
count(dk.invoice_id) as dk_count,
count(ei.invoice_id) as ei_count
from
smetki as s
left outer join di_invoices as di on (s.smetka_id = di.smetka_id)
left outer join dk_cor_invoices as dk on (s.smetka_id = dk.smetka_id)
left outer join ei_invoices as ei on (s.smetka_id = ei.smetki_id)
where
s.created_on between '2007-12-10' and '2007-12-15'
group by
s.smetka_id
Now I have query which executes more than 15 minutes !! I couldn't wait to the end and killed it. There must be something wrong somewhere?
I sould notice that for the given period there are only 6 rows in the "smetki" table, and different number of rows in each xx_invoices table that correspond to the whole resultset from "smetki". I have to mention that there is one intersection between two of the XX_invoices tables where there are about 120k rows for one "smetka_id", and for the same "smetka_id" there are about 5k rows in the other.
Also - there are no other queries running by the time of measuring the performance.

What could be the solution. A friend of mines helped me. Make a UNION he said. So I did it like that:

(select
s.smetka_id, s.created_on,
count(di.invoice_id) as di_count, 0 as dk_count, 0 as ei_count
from
smetki as s
left outer join di_invoices as di on (s.smetka_id = di.smetka_id)
where
s.created_on between '2007-12-10' and '2007-12-15'
group by
s.smetka_id)
UNION
(select
s.smetka_id, s.created_on,
0 as di_count, count(dk.invoice_id) as dk_count, 0 as ei_count
from
smetki as s
left outer join dk_invoices as dk on (s.smetka_id = dk.smetka_id)
where
s.created_on between '2007-12-10' and '2007-12-15'
group by
s.smetka_id)
UNION
(select
s.smetka_id, s.created_on, 0 as di_count, 0 as ei_count,
count(ei.invoice_id) as ei_count
from
smetki as s
left outer join ei_invoices as ei on (s.smetka_id = ei.smetki_id)
where
s.created_on between '2007-12-10' and '2007-12-15'
group by
s.smetka_id )

OK, but now I have 3 rows for each single "smetka_id" - one row for the results from each table. And then we decided to make a VIEW over that UNION.

Gues what. Now I have the simple query:

select smetka_id, sum(di_count), sum(dk_count), sum(ei_count) from v_test group by smetka_id

Which returns me the desired result for less than a second !

I hope this would help your daily work with MySQL and SQL in general.

Wednesday, October 31, 2007

An Inconvenient Truth

Lot friends of mines were telling me that I should watch that movie (also in wikipedia). I recently had the chance! And it really worth!

It's a documentary film about climate change, specifically global warming, presented by former United States Vice President Al Gore. There are lots of numbers and facts that can make you reconsider some aspects of your living.

So I urge you to watch this movie!

MySQL Tips & Tricks /MAX_ROWS & AVG_ROW_LENGTH/

Hello again ;)
I ran into a tricky situation last couple of days. I have some big tables with varchar fields where I have to JOIN ON varchar fields. The problem does not concern only the JOIN - it is a general problem using varchar / text fields used to be searched one way or another.

So - the situation: We have one table with about 1M rows (relatively small amount data for each row ~90 bytes) on one hand. And another table with records to be matched - about 100k rows (relatively medium amount of data for each row ~ 500 bytes). We do a SELECT which JOINs the two tables based on VARCHAR columns. Do not ask "why?" - This is the business logic; it cannot be otherwise (at least in the beginning). If we index the columns we will got the results fast! What is fast? Less than 50 ms to get around 200 joined rows. Well, OK, you got me - it depends strongly on the hardware configuration and MySQL settings. Let's assume this is our limit. I was very happy to discover that it works that fast. The speed is critical, because we will need to join around 100k rows.

Where is the problem then? Well, it happened that the first table is populated with another several hundred thousand records and it became with about 1.9M rows. And? And, it appeared that the MySQL should copy the temp table on the hard drive. As you can guess - the result is: 204 joined rows (not 204k, but only two hundred and four) for ..... 417 seconds. What about 100k rows? Killing me.

What can we do about that? Well, I know (the business logic predicts) that the first table - that with about 2M rows would never have more than 4 million rows. I also noticed that partitioning is only partially supported in the beta (RC yet) of MySQL 5.1.x, but the recommended for production environment is 5.0.x which does not support partitioning. So, I discovered the great options MAX_ROWS and AVG_ROW_LENGTH. Which does what - it says to the MySQL engine, that this table would never have more than MAX_ROWS tows, and the average row length would be AVG_ROW_LENGTH. You can find the average row length by executing the command SHOW TABLE STATUS FROM [DB_NAME] LIKE 'TBL_NAME'. So, if you know, if your business logic predicts that you will have a large table, BUT this table will have relatively known content (like it will have not more than 4M rows), and you know the average row length, you can set the MAX_ROWS and AVG_ROW_LENGTH for that table. The AVG_ROW_LENGTH is required if you have blob/text/varchar fields.

That will cause the MySQL engine to optimize the space required for that table, also optimizing the myisam_data_pointer_size, which sets the number of bytes used for internal row pointers. Yes - it works only with MyISAM tables. But you will be surprised. After I set the MAX_ROWS and AVG_ROW_LENGTH on that table with the 2M records - I got a surprising results - 90k joined rows for ..... 47 seconds !!

I think I will use these options often in the future, what about you?