Page 1 of 1

Optimizing SQL

Posted: Mon Aug 18, 2014 9:40 am
by johanntagle
Hi. So far I like what I see in OpenEMM, and it's much better than a certain commercial product we bought last year. However, every now and then I see SQL's that badly need revising. Since OpenEMM only support MySQL, why not optimize for MySQL? Here's one - when you do a search for a recipient using the email address, this query is used (explain output below):

Code: Select all

mysql> explain  SELECT * FROM customer_1_tbl cust WHERE (( lower(cust.EMAIL) = lower('user@email.com')));
+----+-------------+-------+------+---------------+------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows     | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+----------+-------------+
|  1 | SIMPLE      | cust  | ALL  | NULL          | NULL | NULL    | NULL | 28383190 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+----------+-------------+
You can see that it does a full table scan and in my very large table this is a killer. In MySQL string comparisons for varchar columns are actually case insensitive, so you can actually just do:

Code: Select all

mysql> explain  SELECT * FROM customer_1_tbl cust WHERE cust.EMAIL = 'user@email.com';
+----+-------------+-------+------+---------------+-------+---------+-------+------+-----------------------+
| id | select_type | table | type | possible_keys | key   | key_len | ref   | rows | Extra                 |
+----+-------------+-------+------+---------------+-------+---------+-------+------+-----------------------+
|  1 | SIMPLE      | cust  | ref  | email         | email | 303     | const |    1 | Using index condition |
+----+-------------+-------+------+---------------+-------+---------+-------+------+-----------------------+
You will see that removing the lower() functions enables MySQL to use the index on the email column.

I know very little Java so I'm not sure if I can do my own patches, but unless you point me to a better thread/forum to post to I intend to update this thread for every inefficient SQL I see (before I also saw what seems to be an inefficient way to search by date range, but I failed to note it down so will need to catch it again).

Thanks.

Re: Optimizing SQL

Posted: Mon Aug 18, 2014 7:08 pm
by maschoff
Thanks for your feedback! We are in the process of finalizing OpenEMM 2014 and will include your feedback here to improve the SQL queries in the Java code.

Re: Optimizing SQL

Posted: Tue Aug 19, 2014 3:33 am
by johanntagle
Okay maschoff I will. If there's any specific area you are aware the query is slow do let me know and I will look at that first. I intend to use OpenEMM in the long term so I'm very game to contribute.

Re: Optimizing SQL

Posted: Tue Aug 19, 2014 4:50 am
by johanntagle
So I found two notable entries in my slow query log. The first:

Code: Select all

  sql_text: DELETE cust, bind FROM customer_1_tbl cust, customer_1_binding_tbl bind WHERE cust.customer_id=bind.customer_id AND bind.user_status=5 AND bind.change_date < date_sub(curdate(), interval 30 day)
query_time: 00:01:19

Code: Select all

mysql> explain DELETE cust, bind FROM customer_1_tbl cust, customer_1_binding_tbl bind WHERE cust.customer_id=bind.customer_id AND bind.user_status=5 AND bind.change_date < date_sub(curdate(), interval 30 day);
+----+-------------+-------+--------+----------------------------+---------+---------+--------------------------+----------+-------------+
| id | select_type | table | type   | possible_keys              | key     | key_len | ref                      | rows     | Extra       |
+----+-------------+-------+--------+----------------------------+---------+---------+--------------------------+----------+-------------+
|  1 | SIMPLE      | bind  | ALL    | cust_1_bind_un,customer_id | NULL    | NULL    | NULL                     | 27766623 | Using where |
|  1 | SIMPLE      | cust  | eq_ref | PRIMARY                    | PRIMARY | 4       | openemm.bind.customer_id |        1 | Using where |
+----+-------------+-------+--------+----------------------------+---------+---------+--------------------------+----------+-------------+
I'm guessing this is run at least once a day, or maybe hourly? I decided to add an index in customer_1_binding_tbl. See the improvement:

Code: Select all

mysql> alter table customer_1_binding_tbl add index (user_status, change_date, customer_id);

mysql> explain  DELETE cust, bind FROM customer_1_tbl cust, customer_1_binding_tbl bind WHERE cust.customer_id=bind.customer_id AND bind.user_status=5 AND bind.change_date < date_sub(curdate(), interval 30 day);
+----+-------------+-------+--------+----------------------------------------+-------------+---------+--------------------------+------+-------------+
| id | select_type | table | type   | possible_keys                          | key         | key_len | ref                      | rows | Extra       |
+----+-------------+-------+--------+----------------------------------------+-------------+---------+--------------------------+------+-------------+
|  1 | SIMPLE      | bind  | range  | cust_1_bind_un,customer_id,user_status | user_status | 9       | NULL                     |    1 | Using where |
|  1 | SIMPLE      | cust  | eq_ref | PRIMARY                                | PRIMARY     | 4       | openemm.bind.customer_id |    1 | Using where |
+----+-------------+-------+--------+----------------------------------------+-------------+---------+--------------------------+------+-------------+
2 rows in set (0.00 sec)

mysql> DELETE cust, bind FROM customer_1_tbl cust, customer_1_binding_tbl bind WHERE cust.customer_id=bind.customer_id AND bind.user_status=5 AND bind.change_date < date_sub(curdate(), interval 30 day);
Query OK, 0 rows affected (0.03 sec)

The second happens when I click on the "Recipients" tab:

Code: Select all

  sql_text: SELECT COUNT(*) FROM customer_1_tbl cust
query_time: 00:01:45
I decided to choose a specific mailing list that has around 7M recipients and saw:

Code: Select all

  sql_text: SELECT COUNT(*) FROM customer_1_tbl cust WHERE (cust.customer_id in (SELECT customer_id FROM customer_1_binding_tbl bind WHERE (bind.mailinglist_id = 1)))
query_time: 00:00:53
This might be too late for the 2014 release because it will be a major change but I suggest you reevaluate if you really need an exact count every time the page is listed. In my opinion, making the user wait for a minute or so just to show that number on the page is not worth it. I would instead have a summary/statistics table that has the count as of a certain time (maybe updated daily) and display that number with a disclaimer and then maybe give the user the ability to refresh the summary counts when he needs an accurate number.

More to come :)

Re: Optimizing SQL

Posted: Wed Sep 24, 2014 6:57 am
by johanntagle
Me again =)

It would be nice if you can allow a date-based target group definition that doesn't make use of date_format. I understand the usefulness of matching a MMDD-formated date for birthdays and anniversaries but for use-cases such as "3 days ago" or newer than a certain date, it's more efficient to do a straightforward date comparison, especially if the date column is indexed. See the following:

For the original target group definition date_format(cust.INDEXED_DATE_COLUMN, '%Y%m%d') >= '20131101')

Code: Select all

mysql> explain SELECT cust.customer_id customer_id, bind.user_type user_type, bind.mediatype mediatype FROM customer_1_tbl cust, customer_1_binding_tbl bind WHERE cust.customer_id = bind.customer_id AND (bind.user_status = 1 AND bind.mailinglist_id = 1 AND  date_format(cust.INDEXED_DATE_COLUMN, '%Y%m%d') >= '20131101');
+----+-------------+-------+--------+----------------------------------------+-------------+---------+--------------------------+----------+-----------------------+
| id | select_type | table | type   | possible_keys                          | key         | key_len | ref                      | rows     | Extra                 |
+----+-------------+-------+--------+----------------------------------------+-------------+---------+--------------------------+----------+-----------------------+
|  1 | SIMPLE      | bind  | ref    | cust_1_bind_un,customer_id,user_status | user_status | 5       | const                    | 15029510 | Using index condition |
|  1 | SIMPLE      | cust  | eq_ref | PRIMARY                                | PRIMARY     | 4       | openemm.bind.customer_id |        1 | Using where           |
+----+-------------+-------+--------+----------------------------------------+-------------+---------+--------------------------+----------+-----------------------+
You can see the use of date_format prevents MySQL from using the index on INDEXED_DATE_COLUMN. Now if you change the target_sql to cust.INDEXED_DATE_COLUMN >= '2013-11-01'

Code: Select all

mysql> explain SELECT cust.customer_id customer_id, bind.user_type user_type, bind.mediatype mediatype FROM customer_1_tbl cust, customer_1_binding_tbl bind WHERE cust.customer_id = bind.customer_id AND (bind.user_status = 1 AND bind.mailinglist_id = 1 AND  cust.INDEXED_DATE_COLUMN >= '2013-11-01');
+----+-------------+-------+--------+----------------------------------------+---------------------+---------+--------------------------------+---------+--------------------------+
| id | select_type | table | type   | possible_keys                          | key                 | key_len | ref                            | rows    | Extra                    |
+----+-------------+-------+--------+----------------------------------------+---------------------+---------+--------------------------------+---------+--------------------------+
|  1 | SIMPLE      | cust  | range  | PRIMARY,indexed_date_column            | indexed_date_column | 4       | NULL                           | 1818952 | Using where; Using index |
|  1 | SIMPLE      | bind  | eq_ref | cust_1_bind_un,customer_id,user_status | cust_1_bind_un      | 8       | openemm.cust.customer_id,const |       1 | Using where              |
+----+-------------+-------+--------+----------------------------------------+---------------------+---------+--------------------------------+---------+--------------------------+
You can see it used the index on that column, greatly reducing the number of rows read from 15M+ to 1.8M+!

Re: Optimizing SQL

Posted: Mon Oct 20, 2014 12:26 pm
by maschoff
Thank you for your feedback!