I originally posted this at http://dtrace.org/blogs/brendan/2011/06/23/mysql-performance-schema-and-dtrace.
For a series of blog posts on file system latency, I used MySQL as an example application to examine. I was doing so using DTrace, but as it turns out, MySQL 5.5 GA and later has the performance schema which can measure file system latency as well. Mark Leith recently posted how to do this. For more background on the performance schema, see the recent talk by Peter Zaitsev; (and if you want to get really deep, see the 600+ line comment at the top of storage/perfschema/pfs.cc).
However, the road for DTrace doesn't quite end here.
If DTrace is available, you can use it in addition to the performance schema to investigate other areas, including kernel internals. But you can also use DTrace with the performance schema, by tracing the functions that were added to MySQL to implement it. This is instrumenting the instrumentation - taking the capabilities of the performance schema and building upon them. In this post I'll demonstrate this for a couple of reasons: examining the overhead of the performance schema and extending the performance schema a little.
Overhead
There have been many good articles about performance schema overhead: see DimitriK and Vadim Tkachenko's posts. The overhead depends on the workload, what instrumentation is enabled, how much CPU headroom there is (idle), and the MySQL version. CPU-bound workloads with lock instrumentation enabled will cost the most to instrument, whereas I/O-bound workloads will cost the least (approaching 0%). Extra instrumentation does come at a cost, just as it does with DTrace, and cost that is relative to the frequency of events. Lock events can happen very frequently (100,000+s per sec), especially if the workload is hot on-CPU. And if the workload is bound by CPU performance, then extra instrumentation may be taking cycles away from doing queries.
An example of a CPU-bound workload is benchmarks, such as sysbench, which I've been using to test the performance schema. Unless you spend time tuning them to match your environment they will likely paint a worst-case picture for instrumentation overhead. It's easy to create a working set that caches entirely in DRAM, driven by multiple client threads that are calling queries as fast as possible, making it sensitive to extra instrumentation. The performance overhead may look bad: for both the performance schema and DTrace.
What's different here is that the performance schema adds overhead when it is enabled but not actually used (consumers OFF), which could cost 10% performance for heavy workloads. As DimitriK wrote, this should be lower, and was on prototype code which showed 2-3% overhead instead. This is still a degree of overhead, whereas DTrace has near-zero when not in use (mysql provider's additional NOPs), and literally zero when the pid provider is used (original binary runs unmodified). Vadim wrote:
As I understand dtrace / systemtap probes can be disabled / enabled at run-time, and when they disabled – it is almost 0% overhead, why Performance Schema can’t do the same ?
If the performance-schema option wasn't enabled on MySQL startup, then the the performance schema can't be used at all until mysqld is restarted with that option. Such a restart may be costly in terms of downtime, and it may also cause MySQL to behave differently for hours (cold cache) until the original performance issue reappears. Hence the desire to be able to always have this enabled, but without costing too much overhead until it's needed and active.
Examining Overhead
I'll use DTrace to take a quick peek at the overhead caused by the performance schema on MySQL 5.5.31. I started by profiling stack traces (dtrace -n 'profile-997 /pid == PID/ { @[ustack()] = count(); }'), but had too much data to dig through quickly. Instead, I've picked a simple function to examine how it runs with and without the performance schema, by tracing every sub-function that it calls. This is set_query_id() from sql/sql_class.cc:
3505 void THD::set_query_id(query_id_t new_query_id) 3506 { 3507 mysql_mutex_lock(&LOCK_thd_data); 3508 query_id= new_query_id; 3509 mysql_mutex_unlock(&LOCK_thd_data); 3510 }
The performance schema defaults include instrumentation for locks, so this should operate differently when the performance schema is enabled.
I'll use the following DTrace script to trace the execution of the function:
# cat -n set_query_id.d 1 #!/usr/sbin/dtrace -Fs 2 3 pid$target::*set_query_id6*:entry 4 { 5 self->ok = 1; 6 } 7 8 pid$target:::entry, 9 pid$target:::return 10 /self->ok/ 11 { ; } 12 13 pid$target::*set_query_id6*:return 14 /self->ok/ 15 { 16 exit(0); 17 }
Please do not run this in production, as it traces over 49,000 probes! Dimitri made this point very well as a danger of using DTrace: if you don't know what you are doing, you could cause significant performance loss - much more so than with the MySQL performance schema. The script above limits its overhead by exiting on line 16 as soon as it has traced a single flow of set_query_id(). I'm also not running this in production - I wrote this just to explore the execution of MySQL on a test server.
This script uses the DTrace flowindent option (-F) to indent the output based on the function depth. The target function name was traced using "*set_query_id6*" to match the correct function; the wildcards match the extra C++ symbol data and the '6' matched this function instead of another:
# dtrace -ln 'pid$target::*set_query_id*:entry' -p 20775 ID PROVIDER MODULE FUNCTION NAME 75845 pid20775 mysqld __1cDTHDMset_query_id6Mx_v_ entry 76201 pid20775 mysqld __1cEItemYreset_query_id_processor6MpC_b_ entry
Now for the results.
No Performance Schema
# ./set_query_id.d -p 8767 dtrace: script './set_query_id.d' matched 49744 probes CPU FUNCTION 14 -> __1cDTHDMset_query_id6Mx_v_ 14 -> mutex_lock 14 -> mutex_lock_impl 14 -> sigon 14 <- sigon 14 <- mutex_lock_impl 14 <- mutex_lock 14 -> mutex_unlock 14 -> mutex_unlock_queue 14 -> clear_lockbyte 14 <- clear_lockbyte 14 -> sigon 14 <- sigon 14 <- mutex_unlock_queue 14 <- mutex_unlock 14 | __1cDTHDMset_query_id6Mx_v_:return 14 <- __1cDTHDMset_query_id6Mx_v_
7 functions were called (count the entries, "->") during set_query_id().
DTrace's flowindent option has made the output easy to follow. It's neat for short uses like this where the thread stayed on one CPU the whole time. If the thread switches CPU, then the output can be shuffled when DTrace prints the per-CPU buffers, and requires including a timestamp for post-sorting.
Performance Schema Enabled, consumers=YES, instruments=YES
Now I've restarted mysqld with the performance-schema option:
# ./set_query_id.d -p 20251 dtrace: script './set_query_id.d' matched 49744 probes CPU FUNCTION 6 -> __1cDTHDMset_query_id6Mx_v_ 6 -> get_thread_mutex_locker_v1 6 -> pthread_getspecific 6 <- pthread_getspecific 6 <- get_thread_mutex_locker_v1 6 -> start_mutex_wait_v1 6 -> __1cPget_timer_value6FnPenum_timer_name__X_ 6 -> my_timer_cycles 6 <- my_timer_cycles 6 <- __1cPget_timer_value6FnPenum_timer_name__X_ 6 <- start_mutex_wait_v1 6 -> mutex_lock 6 -> mutex_lock_impl 6 -> sigon 6 <- sigon 6 <- mutex_lock_impl 6 <- mutex_lock 6 -> end_mutex_wait_v1 6 -> __1cPget_timer_value6FnPenum_timer_name__X_ 6 -> my_timer_cycles 6 <- my_timer_cycles 6 <- __1cPget_timer_value6FnPenum_timer_name__X_ 6 -> __1cbBinsert_events_waits_history6FpnKPFS_thread_pnQPFS_events_w... 6 <- __1cbBinsert_events_waits_history6FpnKPFS_thread_pnQPFS_events_w... 6 -> __1cbGinsert_events_waits_history_long6FpnQPFS_events_waits__v_ 6 -> atomic_add_32_nv 6 <- atomic_add_32_nv 6 <- __1cbGinsert_events_waits_history_long6FpnQPFS_events_waits__v_ 6 -> __1cbLfind_per_thread_mutex_class_wait_stat6FpnKPFS_thread_pnPPF... 6 <- __1cbLfind_per_thread_mutex_class_wait_stat6FpnKPFS_thread_pnPPF... 6 <- end_mutex_wait_v1 6 -> unlock_mutex_v1 6 <- unlock_mutex_v1 6 -> mutex_unlock 6 -> mutex_unlock_queue 6 -> clear_lockbyte 6 <- clear_lockbyte 6 -> sigon 6 <- sigon 6 <- mutex_unlock_queue 6 <- mutex_unlock 6 | __1cDTHDMset_query_id6Mx_v_:return 6 <- __1cDTHDMset_query_id6Mx_v_
Many more functions are executed: those ending in "_v1" are from the performance schema additions. The extra code included calling my_timer_cycles() twice, to fetch high resolution timestamps to time the wait event, and functions to insert the event into the waits history.
Performance Schema Enabled, consumers=NO, instruments=YES
I read a couple of articles that seemed to suggest that setting CONSUMERS=OFF was how to dynamically disable the performance schema, even though INSTRUMENTS are still ON. ie, this:
mysql> update setup_consumers set ENABLED='NO'; Query OK, 8 rows affected (0.00 sec) Rows matched: 8 Changed: 8 Warnings: 0 mysql> select * from setup_consumers; +----------------------------------------------+---------+ | NAME | ENABLED | +----------------------------------------------+---------+ | events_waits_current | NO | | events_waits_history | NO | | events_waits_history_long | NO | | events_waits_summary_by_thread_by_event_name | NO | [...]
Here's what happens:
# ./set_query_id.d -p 20775 dtrace: script './set_query_id.d' matched 49744 probes CPU FUNCTION 1 -> __1cDTHDMset_query_id6Mx_v_ 1 -> get_thread_mutex_locker_v1 1 <- get_thread_mutex_locker_v1 1 -> mutex_lock 1 -> mutex_lock_impl 1 -> sigon 1 <- sigon 1 <- mutex_lock_impl 1 <- mutex_lock 1 -> unlock_mutex_v1 1 <- unlock_mutex_v1 1 -> mutex_unlock 1 -> mutex_unlock_queue 1 -> clear_lockbyte 1 <- clear_lockbyte 1 -> sigon 1 <- sigon 1 <- mutex_unlock_queue 1 <- mutex_unlock 1 | __1cDTHDMset_query_id6Mx_v_:return 1 <- __1cDTHDMset_query_id6Mx_v_
This is much better, but it's still calling more functions: 9 instead of 7. The additional functions are get_thread_mutex_locker_v1() and unlock_mutex_v1(). Turning off all consumers has not stopped all the performance schema code from firing.
Performance Schema Enabled, consumers=NO, instruments=NO
Now I'll turn off instruments as well, using:
mysql> update setup_instruments set ENABLED='NO'; Query OK, 214 rows affected (0.01 sec) Rows matched: 214 Changed: 214 Warnings: 0 mysql> select * from setup_instruments; +------------------------------------------------------------+---------+-------+ | NAME | ENABLED | TIMED | +------------------------------------------------------------+---------+-------+ | wait/synch/mutex/sql/PAGE::lock | NO | YES | | wait/synch/mutex/sql/TC_LOG_MMAP::LOCK_sync | NO | YES | | wait/synch/mutex/sql/TC_LOG_MMAP::LOCK_active | NO | YES | [...]
Resulting in:
# ./set_query_id.d -p 20775 dtrace: script './set_query_id.d' matched 49744 probes CPU FUNCTION 2 -> __1cDTHDMset_query_id6Mx_v_ 2 -> mutex_lock 2 -> mutex_lock_impl 2 -> sigon 2 <- sigon 2 <- mutex_lock_impl 2 <- mutex_lock 2 -> mutex_unlock 2 -> mutex_unlock_queue 2 -> clear_lockbyte 2 <- clear_lockbyte 2 -> sigon 2 <- sigon 2 <- mutex_unlock_queue 2 <- mutex_unlock 2 | __1cDTHDMset_query_id6Mx_v_:return 2 <- __1cDTHDMset_query_id6Mx_v_
That's better - we are back to 7 function calls. This would seem like a better way to have the performance schema ready but dynamically disabled. I don't know if there is a downside to disabling the instruments as well as the consumers in this way.
Not so fast...
Unfortunately, not all functions work like set_query_id(). Many others still call into the performance schema when consumers and instruments are OFF:
# dtrace -n 'pid$target::*_v1:entry { @[probefunc] = count(); } tick-10s { exit(0); }' -p 20775 dtrace: description 'pid$target::*_v1:entry ' matched 57 probes CPU ID FUNCTION:NAME 10 75844 :tick-10s get_thread_file_descriptor_locker_v1 39 destroy_rwlock_v1 577 init_mutex_v1 4782 init_rwlock_v1 17476 signal_cond_v1 219790 get_thread_cond_locker_v1 220468 get_thread_rwlock_locker_v1 500894 unlock_rwlock_v1 520604 get_thread_mutex_locker_v1 1269483 unlock_mutex_v1 1269485
That hundreds of thousands of these functions are still being called every second, despite trying to disable them, should go far to explain the disabled overhead that was reaching 10%.
DTrace can identify the reason why some of these are still being called by examining their stack trace:
# dtrace -n 'pid$target::unlock_mutex_v1:entry { @[ustack(5)] = count(); }' -p 20775 dtrace: description 'pid$target::unlock_mutex_v1:entry ' matched 1 probe ^C [...] mysqld`unlock_mutex_v1 mysqld`__1cLQdDuery_cacheItry_lock6Mb_b_+0x30f mysqld`__1cLQdDuery_cacheVsend_result_to_client6MpnDTHD_pcI_i_+0xe6 mysqld`__1cSPrepared_statementHexecute6MpnGString_b_b_+0x216 mysqld`__1cSPrepared_statementMexecute_loop6MpnGString_bpC3_b_+0x8c 24009 mysqld`unlock_mutex_v1 mysqld`thr_alarm+0x209 mysqld`__1cMmy_real_read6FpnGst_net_pL_L_+0xeb mysqld`my_net_read+0x201 mysqld`__1cKdo_command6FpnDTHD__b_+0x72 24694 mysqld`unlock_mutex_v1 mysqld`thr_end_alarm+0x13d mysqld`__1cMmy_real_read6FpnGst_net_pL_L_+0x31a mysqld`my_net_read+0x201 mysqld`__1cKdo_command6FpnDTHD__b_+0x72 24694
Given that the set_query_id() function was returned to its original state, I would hope that it may be technically possible to stop calling all or most of these additional performance schema functions, allowing the disabled overhead to approach 0%. Stack back traces like those above would be essential for this effort, as it lets the engineer immediately see where in the code still needs work.
I'll stop here; I just wanted to take a peek into performance schema overhead using DTrace, and found some interesting things that could help reduce this overhead. And that's a key difference between the technologies:
The performance schema can't trace DTrace.
;-)
Extending
The performance schema adds functions (many dynamically) to gather event data. There is a list of these in the performance schema source, storage/perfschema/pfs.cc:
2042 PSI_v1 PFS_v1= 2043 { 2044 register_mutex_v1, 2045 register_rwlock_v1, 2046 register_cond_v1, 2047 register_thread_v1, 2048 register_file_v1, 2049 init_mutex_v1, 2050 destroy_mutex_v1, 2051 init_rwlock_v1, 2052 destroy_rwlock_v1, 2053 init_cond_v1, 2054 destroy_cond_v1, 2055 get_table_share_v1, 2056 release_table_share_v1, 2057 open_table_v1, 2058 close_table_v1, 2059 create_file_v1, 2060 spawn_thread_v1, 2061 new_thread_v1, [...]
As a DTrace-guy, this is the next-best-thing to having a USDT provider (like the mysql provider). These add traceable functions at logical points and with meaningful names. Instead of having to wade through all the source looking for things to trace, I can start with these as they've already been added at interesting places.
I'll trace one as an example; here's the prototype for start_mutex_wait_v1():
1596 static void start_mutex_wait_v1(PSI_mutex_locker* locker, 1597 const char *src_file, uint src_line)
These functions provide meaningful arguments as well, in this case a char * for the source file and int for the line number. This makes it very easy to write powerful DTrace scripts, or one-liners:
# dtrace -n 'pid$target::start_mutex_wait_v1:entry { self->loc = strjoin(strjoin(copyinstr(arg1), ":"), lltostr(arg2)); self->s = timestamp; } pid$target::start_mutex_wait_v1:return /self->s/ { @[self->loc, "(ns)"] = quantize(timestamp - self->s); self->s = 0; }' -p 22147 dtrace: description 'pid$target::start_mutex_wait_v1:entry ' matched 2 probes ^C [...] /export/home2/pb2/build/sb_0-3314961-1305135650.29/mysql-5.5.13/sql/sql_class.cc:3486 (ns) value ------------- Distribution ------------- count 1024 | 0 2048 |@ 405 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 9377 8192 |@@@@ 1191 16384 |@ 216 32768 | 114 65536 | 63 131072 | 33 262144 | 5 524288 | 2 1048576 | 0 /export/home2/pb2/build/sb_0-3314961-1305135650.29/mysql-5.5.13/storage/innobase/buf/buf0buf.c:2766 (ns) value ------------- Distribution ------------- count 2048 | 0 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 27843 8192 |@@@@@@@@@@ 10287 16384 |@ 955 32768 |@ 532 65536 | 272 131072 | 154 262144 | 49 524288 | 19 1048576 | 1 2097152 | 0
There I'm tracing mutex wait time in nanoseconds as a distribution plot, along with the filename and line number for the mutex lock. (And some pathname stuff that looks like it's from the build server for this MySQL distribution.) Here is line 2766 from buf0buf.c:
2766 buf_pool_mutex_enter(buf_pool);
According to the distribution, most of the time threads waited on this for between 4 and 16 microseconds, with one instance reaching over 1 millisecond.
Stack Traces
One of the easiest extensions is to add stack traces to show the code path that led to the event; here I've added ustack() as a key to the aggregation in the previous one-liner, resulting in:
/export/home2/pb2/build/sb_0-3314961-1305135650.29/mysql-5.5.13/sql/sql_class.cc:3486 (ns) mysqld`start_mutex_wait_v1+0x48 mysqld`__1cQdispatch_command6FnTenum_server_command_pnDTHD_pcI_b_+0x1c5a mysqld`__1cKdo_command6FpnDTHD__b_+0xcb mysqld`__1cYdo_handle_one_connection6FpnDTHD__v_+0x2f0 mysqld`handle_one_connection+0x51 mysqld`pfs_spawn_thread+0x5d libc.so.1`_thrp_setup+0x8d libc.so.1`_lwp_start value ------------- Distribution ------------- count 2048 | 0 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3261 8192 |@@@@@@@@ 838 16384 |@ 82 32768 |@ 70 65536 | 16 131072 | 12 262144 | 1 524288 | 5 1048576 | 0
sql_class.cc line 3486 is in set_query():
3484 void THD::set_query(const CSET_STRING &string_arg) 3485 { 3486 mysql_mutex_lock(&LOCK_thd_data); 3487 set_query_inner(string_arg); 3488 mysql_mutex_unlock(&LOCK_thd_data); 3489 }
But the stack trace above is showing dispatch_command(), not set_query(). I couldn't find documentation on this (other than the source code), but I think the start_mutex_wait_v1() function is showing us the locker - who has held the lock, whereas the DTrace ustack() is showing who is waiting on the lock. set_query() would be holding onto LOCK_thd_data and blocking dispatch_command().
But...
You do need the performance schema enabled and running for DTrace to trace these functions. If the performance schema is off, DTrace still works, but you'll need to trace the original MySQL functions.
So far this looks of use for providing more context for lock, file and thread events. Here are all the PFS functions that are firing during my test:
# dtrace -n 'pid$target::*_v1:entry { @[probefunc] = count(); }' -p 23285 dtrace: description 'pid$target::*_v1:entry ' matched 56 probes ^C delete_current_thread_v1 1 destroy_cond_v1 1 init_cond_v1 17 new_thread_v1 17 set_thread_v1 17 destroy_rwlock_v1 195 init_rwlock_v1 212 end_cond_wait_v1 685 start_cond_wait_v1 685 get_thread_cond_locker_v1 686 broadcast_cond_v1 2366 destroy_mutex_v1 4751 init_mutex_v1 4803 end_file_wait_v1 11375 start_file_wait_v1 11376 get_thread_file_descriptor_locker_v1 11377 signal_cond_v1 14182 end_rwlock_wrwait_v1 85181 start_rwlock_wrwait_v1 85181 end_rwlock_rdwait_v1 224720 start_rwlock_rdwait_v1 224721 get_thread_rwlock_locker_v1 309902 unlock_rwlock_v1 379472 end_mutex_wait_v1 1176697 start_mutex_wait_v1 1176698 get_thread_mutex_locker_v1 1176700 unlock_mutex_v1 1176711
Most of these are lock-related functions, which I'm glad to have: lock issues in MySQL can get hard, and any additional context to help diagnose them can be helpful.
To investigate other behaviour, such as of the query cache and indexing, the raw mysqld functions can be traced. Or better, use the mysql provider if it's available.
I'd also expect more functions to be added to the performance schema over time; it's been designed with new versions in mind (see the block comment in pfs.cc).
The Rest of the World
The other time to reach for a tool such as DTrace is whenever you'd like to see behaviour outside MySQL and relate it to MySQL context. For example, showing what code paths are causing page faults to grow memory (not just the malloc() reservations, which we can DTrace too if we like):
# dtrace -n 'vminfo:::as_fault /pid == $target/ { @[ustack()] = count(); }' dtrace: description 'vminfo:::as_fault ' matched 1 probe ^C [...] libmtmalloc.so.1`malloc_internal+0x22a libmtmalloc.so.1`malloc+0x48 mysqld`my_malloc+0x26 mysqld`init_alloc_root+0x62 mysqld`__1cOinit_sql_alloc6FpnLst_mem_root_II_v_+0x19 mysqld`__1cSPrepared_statement2t5B6MpnDTHD__v_+0xb0 mysqld`__1cTmysqld_stmt_prepare6FpnDTHD_pkcI_v_+0x4c mysqld`__1cQdispatch_command6FnTenum_server_command_pnDTHD_pcI_b_+0x1085 mysqld`__1cKdo_command6FpnDTHD__b_+0xcb mysqld`__1cYdo_handle_one_connection6FpnDTHD__v_+0x2f0 mysqld`handle_one_connection+0x51 mysqld`pfs_spawn_thread+0x5d libc.so.1`_thrp_setup+0x8d libc.so.1`_lwp_start 126 mysqld`__1cDLEX2t5B6M_v_+0x50f mysqld`__1cSPrepared_statementHprepare6MpkcI_b_+0xa4 mysqld`__1cTmysqld_stmt_prepare6FpnDTHD_pkcI_v_+0x81 mysqld`__1cQdispatch_command6FnTenum_server_command_pnDTHD_pcI_b_+0x1085 mysqld`__1cKdo_command6FpnDTHD__b_+0xcb mysqld`__1cYdo_handle_one_connection6FpnDTHD__v_+0x2f0 mysqld`handle_one_connection+0x51 mysqld`pfs_spawn_thread+0x5d libc.so.1`_thrp_setup+0x8d libc.so.1`_lwp_start 159
I sometimes use this approach to track down the reasons why applications are growing or leaking memory. The vminfo:::as_fault probe is in the kernel, in the virtual memory code.
Other system areas that can be useful to trace with MySQL include the TCP/IP stack, to examine connection behaviour including dropped connections and packets, and file system internals, to understand the origin of slow I/O. The DTrace book is full of ideas.
Conclusion
In this post I showed how DTrace could examine the performance of the performance schema itself, and how it could trace and extend the performance schema functions a little. DTrace does a lot more, given visibity into the rest of the system, which it can express in MySQL context as needed. I included an example of tracing address space faults by MySQL stack trace, which shows the code path causing memory growth.
That's not to say that DTrace is better than the performance schema - these are different tools. Application-level tools, like the performance schema, should be able to run at a lower enabled-overhead than DTrace (avoiding context switching for instrumentation), and should be safer for beginners to use (no accidentally tracing everything on the system and slowing the target). They can also be customized to keep data in a way that best suits the application. Try to use these tools first if they are available, and then DTrace to dig deeper.
Right now, I'm still analysing a lot of systems that don't have the performance schema (older MySQL) or don't have it enabled. Since DTrace is available, I'm able to use it to investigate issues immediately. I posted examples of this for on-CPU vs off-CPU time and file system latency.