Optimizing SQL

Use this forum for all questions related to the source code of OpenEMM

Moderator: moderator

johanntagle
Posts: 15
Joined: Mon Jul 21, 2014 5:58 am

Optimizing SQL

Post by johanntagle »

Hi. So far I like what I see in OpenEMM, and it's much better than a certain commercial product we bought last year. However, every now and then I see SQL's that badly need revising. Since OpenEMM only support MySQL, why not optimize for MySQL? Here's one - when you do a search for a recipient using the email address, this query is used (explain output below):

Code: Select all

mysql> explain  SELECT * FROM customer_1_tbl cust WHERE (( lower(cust.EMAIL) = lower('user@email.com')));
+----+-------------+-------+------+---------------+------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows     | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+----------+-------------+
|  1 | SIMPLE      | cust  | ALL  | NULL          | NULL | NULL    | NULL | 28383190 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+----------+-------------+
You can see that it does a full table scan and in my very large table this is a killer. In MySQL string comparisons for varchar columns are actually case insensitive, so you can actually just do:

Code: Select all

mysql> explain  SELECT * FROM customer_1_tbl cust WHERE cust.EMAIL = 'user@email.com';
+----+-------------+-------+------+---------------+-------+---------+-------+------+-----------------------+
| id | select_type | table | type | possible_keys | key   | key_len | ref   | rows | Extra                 |
+----+-------------+-------+------+---------------+-------+---------+-------+------+-----------------------+
|  1 | SIMPLE      | cust  | ref  | email         | email | 303     | const |    1 | Using index condition |
+----+-------------+-------+------+---------------+-------+---------+-------+------+-----------------------+
You will see that removing the lower() functions enables MySQL to use the index on the email column.

I know very little Java so I'm not sure if I can do my own patches, but unless you point me to a better thread/forum to post to I intend to update this thread for every inefficient SQL I see (before I also saw what seems to be an inefficient way to search by date range, but I failed to note it down so will need to catch it again).

Thanks.
maschoff
Site Admin
Posts: 2596
Joined: Thu Aug 03, 2006 10:20 am
Location: Munich, Germany
Contact:

Re: Optimizing SQL

Post by maschoff »

Thanks for your feedback! We are in the process of finalizing OpenEMM 2014 and will include your feedback here to improve the SQL queries in the Java code.
OpenEMM Maintainer
johanntagle
Posts: 15
Joined: Mon Jul 21, 2014 5:58 am

Re: Optimizing SQL

Post by johanntagle »

Okay maschoff I will. If there's any specific area you are aware the query is slow do let me know and I will look at that first. I intend to use OpenEMM in the long term so I'm very game to contribute.
johanntagle
Posts: 15
Joined: Mon Jul 21, 2014 5:58 am

Re: Optimizing SQL

Post by johanntagle »

So I found two notable entries in my slow query log. The first:

Code: Select all

  sql_text: DELETE cust, bind FROM customer_1_tbl cust, customer_1_binding_tbl bind WHERE cust.customer_id=bind.customer_id AND bind.user_status=5 AND bind.change_date < date_sub(curdate(), interval 30 day)
query_time: 00:01:19

Code: Select all

mysql> explain DELETE cust, bind FROM customer_1_tbl cust, customer_1_binding_tbl bind WHERE cust.customer_id=bind.customer_id AND bind.user_status=5 AND bind.change_date < date_sub(curdate(), interval 30 day);
+----+-------------+-------+--------+----------------------------+---------+---------+--------------------------+----------+-------------+
| id | select_type | table | type   | possible_keys              | key     | key_len | ref                      | rows     | Extra       |
+----+-------------+-------+--------+----------------------------+---------+---------+--------------------------+----------+-------------+
|  1 | SIMPLE      | bind  | ALL    | cust_1_bind_un,customer_id | NULL    | NULL    | NULL                     | 27766623 | Using where |
|  1 | SIMPLE      | cust  | eq_ref | PRIMARY                    | PRIMARY | 4       | openemm.bind.customer_id |        1 | Using where |
+----+-------------+-------+--------+----------------------------+---------+---------+--------------------------+----------+-------------+
I'm guessing this is run at least once a day, or maybe hourly? I decided to add an index in customer_1_binding_tbl. See the improvement:

Code: Select all

mysql> alter table customer_1_binding_tbl add index (user_status, change_date, customer_id);

mysql> explain  DELETE cust, bind FROM customer_1_tbl cust, customer_1_binding_tbl bind WHERE cust.customer_id=bind.customer_id AND bind.user_status=5 AND bind.change_date < date_sub(curdate(), interval 30 day);
+----+-------------+-------+--------+----------------------------------------+-------------+---------+--------------------------+------+-------------+
| id | select_type | table | type   | possible_keys                          | key         | key_len | ref                      | rows | Extra       |
+----+-------------+-------+--------+----------------------------------------+-------------+---------+--------------------------+------+-------------+
|  1 | SIMPLE      | bind  | range  | cust_1_bind_un,customer_id,user_status | user_status | 9       | NULL                     |    1 | Using where |
|  1 | SIMPLE      | cust  | eq_ref | PRIMARY                                | PRIMARY     | 4       | openemm.bind.customer_id |    1 | Using where |
+----+-------------+-------+--------+----------------------------------------+-------------+---------+--------------------------+------+-------------+
2 rows in set (0.00 sec)

mysql> DELETE cust, bind FROM customer_1_tbl cust, customer_1_binding_tbl bind WHERE cust.customer_id=bind.customer_id AND bind.user_status=5 AND bind.change_date < date_sub(curdate(), interval 30 day);
Query OK, 0 rows affected (0.03 sec)

The second happens when I click on the "Recipients" tab:

Code: Select all

  sql_text: SELECT COUNT(*) FROM customer_1_tbl cust
query_time: 00:01:45
I decided to choose a specific mailing list that has around 7M recipients and saw:

Code: Select all

  sql_text: SELECT COUNT(*) FROM customer_1_tbl cust WHERE (cust.customer_id in (SELECT customer_id FROM customer_1_binding_tbl bind WHERE (bind.mailinglist_id = 1)))
query_time: 00:00:53
This might be too late for the 2014 release because it will be a major change but I suggest you reevaluate if you really need an exact count every time the page is listed. In my opinion, making the user wait for a minute or so just to show that number on the page is not worth it. I would instead have a summary/statistics table that has the count as of a certain time (maybe updated daily) and display that number with a disclaimer and then maybe give the user the ability to refresh the summary counts when he needs an accurate number.

More to come :)
johanntagle
Posts: 15
Joined: Mon Jul 21, 2014 5:58 am

Re: Optimizing SQL

Post by johanntagle »

Me again =)

It would be nice if you can allow a date-based target group definition that doesn't make use of date_format. I understand the usefulness of matching a MMDD-formated date for birthdays and anniversaries but for use-cases such as "3 days ago" or newer than a certain date, it's more efficient to do a straightforward date comparison, especially if the date column is indexed. See the following:

For the original target group definition date_format(cust.INDEXED_DATE_COLUMN, '%Y%m%d') >= '20131101')

Code: Select all

mysql> explain SELECT cust.customer_id customer_id, bind.user_type user_type, bind.mediatype mediatype FROM customer_1_tbl cust, customer_1_binding_tbl bind WHERE cust.customer_id = bind.customer_id AND (bind.user_status = 1 AND bind.mailinglist_id = 1 AND  date_format(cust.INDEXED_DATE_COLUMN, '%Y%m%d') >= '20131101');
+----+-------------+-------+--------+----------------------------------------+-------------+---------+--------------------------+----------+-----------------------+
| id | select_type | table | type   | possible_keys                          | key         | key_len | ref                      | rows     | Extra                 |
+----+-------------+-------+--------+----------------------------------------+-------------+---------+--------------------------+----------+-----------------------+
|  1 | SIMPLE      | bind  | ref    | cust_1_bind_un,customer_id,user_status | user_status | 5       | const                    | 15029510 | Using index condition |
|  1 | SIMPLE      | cust  | eq_ref | PRIMARY                                | PRIMARY     | 4       | openemm.bind.customer_id |        1 | Using where           |
+----+-------------+-------+--------+----------------------------------------+-------------+---------+--------------------------+----------+-----------------------+
You can see the use of date_format prevents MySQL from using the index on INDEXED_DATE_COLUMN. Now if you change the target_sql to cust.INDEXED_DATE_COLUMN >= '2013-11-01'

Code: Select all

mysql> explain SELECT cust.customer_id customer_id, bind.user_type user_type, bind.mediatype mediatype FROM customer_1_tbl cust, customer_1_binding_tbl bind WHERE cust.customer_id = bind.customer_id AND (bind.user_status = 1 AND bind.mailinglist_id = 1 AND  cust.INDEXED_DATE_COLUMN >= '2013-11-01');
+----+-------------+-------+--------+----------------------------------------+---------------------+---------+--------------------------------+---------+--------------------------+
| id | select_type | table | type   | possible_keys                          | key                 | key_len | ref                            | rows    | Extra                    |
+----+-------------+-------+--------+----------------------------------------+---------------------+---------+--------------------------------+---------+--------------------------+
|  1 | SIMPLE      | cust  | range  | PRIMARY,indexed_date_column            | indexed_date_column | 4       | NULL                           | 1818952 | Using where; Using index |
|  1 | SIMPLE      | bind  | eq_ref | cust_1_bind_un,customer_id,user_status | cust_1_bind_un      | 8       | openemm.cust.customer_id,const |       1 | Using where              |
+----+-------------+-------+--------+----------------------------------------+---------------------+---------+--------------------------------+---------+--------------------------+
You can see it used the index on that column, greatly reducing the number of rows read from 15M+ to 1.8M+!
maschoff
Site Admin
Posts: 2596
Joined: Thu Aug 03, 2006 10:20 am
Location: Munich, Germany
Contact:

Re: Optimizing SQL

Post by maschoff »

Thank you for your feedback!
OpenEMM Maintainer
Post Reply