RMOUG’s First WIT Scholarship Winner

I’ve been running the Women in Technology, (WIT) at RMOUG, which I first started planning out in 2011, which has grown to include other user groups and even countries as its grown.  The last couple years, I’ve worked to try to add a WIT scholarship to RMOUG’s, but it wasn’t always easy to convince others that we should provide one when you are asking to choose one group over another.  There’s only so much money to go around and so many initiatives that we need to address each year for the Oracle community.

For 2015, we had a unique candidate submitted from our continual education scholarship, referred to as the Stan Yellott scholarship, after the wonderful RMOUG board member and much loved mentor who passed away in 2006.  The candidate, Natalie Kalin is from Pine Creek High School, out of Colorado Springs, which has students that attend our yearly conference, Training Days, each February.  She stood out, as not only had she chosen to major in robotics in College, but she was already in AP Robotics in high school and was then taking this education and providing STEM initiatives to Middle School students in her district.

NKalin

Due to her technical skills, initiative, stellar transcripts and submission to the scholarship, it was easy for the board of directors at RMOUG to recognize her and award a second scholarship for WIT.  I want to share the announcement from her local school district, along with recognize her accomplishment and thank her for making it so easy for me to create the WIT scholarship this year!

Thank you and congratulations, Natalie!

 



Tags:  

Del.icio.us
Facebook
TweetThis
Digg
StumbleUpon


Copyright © DBA Kevlar [RMOUG's First WIT Scholarship Winner], All Right Reserved. 2015.

What happened to “when the application is fast enough to meet users’ requirements?”

On January 5, I received an email called “Video” from my friend and former employee Guđmundur Jósepsson from Iceland. His friends call him Gummi (rhymes with “who-me”). Gummi is the guy whose name is set in the ridiculous monospace font on page xxiv of Optimizing Oracle Performance, apparently because O’Reilly’s Linotype Birka font didn’t have the letter eth (đ) in it. Gummi once modestly teased me that this is what he is best known for. But I digress...

His email looked like this:


It’s a screen shot of frame 3:12 from my November 2014 video called “Why you need a profiler for Oracle.” At frame 3:12, I am answering the question of how you can know when you’re finished optimizing a given application function. Gummi’s question is, «Oi! What happened to “when the application is fast enough to meet users’ requirements?”»

Gummi noticed (the good ones will do that) that the video says something different than the thing he had heard me say for years. It’s a fair question. Why, in the video, have I said this new thing? It was not an accident.

When are you finished optimizing?

The question in focus is, “When are you finished optimizing?” Since 2003, I have actually used three different answers:
When are you are finished optimizing?
  1. When the cost of call reduction and latency reduction exceeds the cost of the performance you’re getting today.
    Source: Optimizing Oracle Performance (2003) pages 302–304.
  2. When the application is fast enough to meet your users’ requirements.
    Source: I have taught this in various courses, conferences, and consulting calls since 1999 or so.
  3. When there are no unnecessary calls, and the calls that remain run at hardware speed.
    Source: “Why you need a profiler for Oracle” (2014) frames 2:51–3:20.
My motive behind answers A and B was the idea that optimizing beyond what your business needs can be wasteful. I created these answers to deter people from misdirecting time and money toward perfecting something when those resources might be better invested improving something else. This idea was important, and it still is.

So, then, where did C come from? I’ll begin with a picture. The following figure allows you to plot the response time for a single application function, whatever “given function” you’re looking at. You could draw a similar figure for every application function on your system (although I wouldn’t suggest it).


Somewhere on this response time axis for your given function is the function’s actual response time. I haven’t marked that response time’s location specifically, but I know it’s in the blue zone, because at the bottom of the blue zone is the special response time RT. This value RT is the function’s top speed on the hardware you own today. Your function can’t go faster than this without upgrading something.

It so happens that this top speed is the speed at which your function will run if and only if (i) it contains no unnecessary calls and (ii) the calls that remain run at hardware speed. ...Which, of course, is the idea behind this new answer C.

Where, exactly, is your “requirement”?

Answer B (“When the application is fast enough to meet your users’ requirements”) requires that you know the users’ response time requirement for your function, so, next, let’s locate that value on our response time axis.

This is where the trouble begins. Most DBAs don’t know what their users’ response time requirements really are. Don’t despair, though; most users don’t either.

At banks, airlines, hospitals, telcos, and nuclear plants, you need strict service level agreements, so those businesses investment into quantifying them. But realize: quantifying all your functions’ response time requirements isn’t about a bunch of users sitting in a room arguing over which subjective speed limits sound the best. It’s about knowing your technological speed limits and understanding how close to those values your business needs to pay to be. It’s an expensive process. At some companies, it’s worth the effort; at most companies, it’s just not.

How about using, “well, nobody complains about it,” as all the evidence you need that a given function is meeting your users’ requirement? It’s how a lot of people do it. You might get away with doing it this way if your systems weren’t growing. But systems do grow. More data, more users, more application functions: these are all forms of growth, and you can probably measure every one of them happening where you’re sitting right now. All these forms of growth put you on a collision course with failing to meet your users’ response time requirements, whether you and your users know exactly what they are, or not.

In any event, if you don’t know exactly what your users’ response time requirements are, then you won’t be able to use “meets your users’ requirement” as your finish line that tells you when to stop optimizing. This very practical problem is the demise of answer B for most people.

Knowing your top speed

Even if you do know exactly what your users’ requirements are, it’s not enough. You need to know something more.

Imagine for a minute that you do know your users’ response time requirement for a given function, and let’s say that it’s this: “95% of executions of this function must complete within 5 seconds.” Now imagine that this morning when you started looking at the function, it would typically run for 10 seconds in your Oracle SQL Developer worksheet, but now after spending an hour or so with it, you have it down to where it runs pretty much every time in just 4 seconds. So, you’ve eliminated 60% of the function’s response time. That’s a pretty good day’s work, right? The question is, are you done? Or do you keep going?

Here is the reason that answer C is so important. You cannot responsibly answer whether you’re done without knowing that function’s top speed. Even if you know how fast people want it to run, you can’t know whether you’re finished without knowing how fast it can run.

Why? Imagine that 85% of those 4 seconds are consumed by Oracle enqueue, or latch, or log file sync calls, or by hundreds of parse calls, or 3,214 network round-trips to return 3,214 rows. If any of these things is the case, then no, you’re absolutely not done yet. If you were to allow some ridiculous code path like that to survive on a production system, you’d be diminishing the whole system’s effectiveness for everybody (even people who are running functions other than the one you’re fixing).

Now, sure, if there’s something else on the system that has a higher priority than finishing the fix on this function, then you should jump to it. But you should at least leave this function on your to-do list. Your analysis of the higher priority function might even reveal that this function’s inefficiencies are causing the higher-priority functions problems. Such can be the nature of inefficient code under conditions of high load.

On the other hand, if your function is running in 4 seconds and (i) its profile shows no unnecessary calls, and (ii) the calls that remain are running at hardware speeds, then you’ve reached a milestone:
  1. if your code meets your users’ requirement, then you’re done;
  2. otherwise, either you’ll have to reimagine how to implement the function, or you’ll have to upgrade your hardware (or both).
There’s that “users’ requirement” thing again. You see why it has to be there, right?

Well, here’s what most people do. They get their functions’ response times reasonably close to their top speeds (which, with good people, isn’t usually as expensive as it sounds), and then they worry about requirements only if those requirements are so important that it’s worth a project to quantify them. A requirement is usually considered really important if it’s close to your top speed or if it’s really expensive when you violate a service level requirement.

This strategy works reasonably well.

It is interesting to note here that knowing a function’s top speed is actually more important than knowing your users’ requirements for that function. A lot of companies can work just fine not knowing their users’ requirements, but without knowing your top speeds, you really are in the dark. A second observation that I find particularly amusing is this: not only is your top speed more important to know, your top speed is actually easier to compute than your users’ requirement (…if you have a profiler, which was my point in the video).

Better and easier is a good combination.

Tomorrow is important, too

When are you are finished optimizing?
  1. When the cost of call reduction and latency reduction exceeds the cost of the performance you’re getting today.
  2. When the application is fast enough to meet your users’ requirements.
  3. When there are no unnecessary calls, and the calls that remain run at hardware speed.
Answer A is still a pretty strong answer. Notice that it actually maps closely to answer C. Answer C’s prescription for “no unnecessary calls” yields answer A’s goal of call reduction, and answer C’s prescription for “calls that remain run at hardware speed” yields answer A’s goal of latency reduction. So, in a way, C is a more action-oriented version of A, but A goes further to combat the perfectionism trap with its emphasis on the cost of action versus the cost of inaction.

One thing I’ve grown to dislike about answer A, though, is its emphasis on today in “…exceeds the cost of the performance you’re getting today.” After years of experience with the question of when optimization is complete, I think that answer A under-emphasizes the importance of tomorrow. Unplanned tomorrows can quickly become ugly todays, and as important as tomorrow is to businesses and the people who run them, it’s even more important to another community: database application developers.

Subjective goals are treacherous for developers

Many developers have no way to test, today, the true production response time behavior of their code, which they won’t learn until tomorrow. ...And perhaps only until some remote, distant tomorrow.

Imagine you’re a developer using 100-row tables on your desktop to test code that will access 100,000,000,000-row tables on your production server. Or maybe you’re testing your code’s performance only in isolation from other workload. Both of these are problems; they’re procedural mistakes, but they are everyday real-life for many developers. When this is how you develop, telling you that “your users’ response time requirement is n seconds” accidentally implies that you are finished optimizing when your query finishes in less than n seconds on your no-load system of 100-row test tables.

If you are a developer writing high-risk code—and any code that will touch huge database segments in production is high-risk code—then of course you must aim for the “no unnecessary calls” part of the top speed target. And you must aim for the “and the calls that remain run at hardware speed” part, too, but you won’t be able to measure your progress against that goal until you have access to full data volumes and full user workloads.

Notice that to do both of these things, you must have access to full data volumes and full user workloads in your development environment. To build high-performance applications, you must do full data volume testing and full user workload testing in each of your functional development iterations.

This is where agile development methods yield a huge advantage: agile methods provide a project structure that encourages full performance testing for each new product function as it is developed. Contrast this with the terrible project planning approach of putgin all your performance testing at the end of your project, when it’s too late to actually fix anything (if there’s even enough budget left over by then to do any testing at all). If you want a high-performance application with great performance diagnostics, then performance instrumentation should be an important part of your feedback for each development iteration of each new function you create.

My answer

So, when are you finished optimizing?
  1. When the cost of call reduction and latency reduction exceeds the cost of the performance you’re getting today.
  2. When the application is fast enough to meet your users’ requirements.
  3. When there are no unnecessary calls and the calls that remain run at hardware speed.
There is some merit in all three answers, but as Dave Ensor taught me inside Oracle many years ago, the correct answer is C. Answer A specifically restricts your scope of concern to today, which is especially dangerous for developers. Answer B permits you to promote horrifically bad code, unhindered, into production, where it can hurt the performance of every function on the system. Answers A and B both presume that you know information that you probably don’t know and that you may not need to know. Answer C is my favorite answer because it is tells you exactly when you’re done, using units you can measure and that you should be measuring.

Answer C is usually a tougher standard than answer A or B, and when it’s not, it is the best possible standard you can meet without upgrading or redesigning something. In light of this “tougher standard” kind of talk, it is still important to understand that what is optimal from a software engineering perspective is not always optimal from a business perspective. The term optimized must ultimately be judged within the constraints of what the business chooses to pay for. In the spirit of answer A, you can still make the decision not to optimize all your code to the last picosecond of its potential. How perfect you make your code should be a business decision. That decision should be informed by facts, and these facts should include knowledge of your code’s top speed.

Thank you, Guđmundur Jósepsson, of Iceland, for your question. Thank you for waiting patiently for several weeks while I struggled putting these thoughts into words.

Log Buffer #412, A Carnival of the Vanities for DBAs

This Log Buffer Edition makes it way through the realms of Oracle, SQL Server and MySQL and brings you some of the blog posts.

Oracle:

Introducing Oracle Big Data Discovery Part 3: Data Exploration and Visualization

FULL and NO_INDEX Hints

Base64 Encode / Decode with Python (or WebLogic Scripting Tool) by Frank Munz

Why I’m Excited About Oracle Integration Cloud Service – New Video

Reminder: Upgrade Database 12.1.0.1 to 12.1.0.2 by July 2015

SQL Server:

An article about how we underestimate the power of joins and degrade our query performance by not using proper joins

Most large organizations have implemented one or more big data applications. As more data accumulates internal users and analysts execute more reports and forecasts, which leads to additional queries and analysis, and more reporting.

How do you develop and deploy your database?

A database must be able to maintain and enforce the business rules and relationships in data in order to maintain the data model.

Error handling with try-catch-finally in PowerShell for SQL Server

MySQL:

MySQL Enterprise Monitor 3.0.20 has been released

MySQL Cluster 7.4 is GA!

Connector/Python 2.1.1 Alpha released with C Extension

Worrying about the ‘InnoDB: detected cycle in LRU for buffer pool (…)’ message?

MySQL Cluster 7.4 GA: 200 Million QPS, Active-Active Geographic Replication and more

SQL Developer: Viewing Trace Files

Just a quick plug for looking at raw sql trace files via SQL Developer.

There is a nice Tree View:
sqldev_trace

Which can be expanded:
sqldev_trace_expand

Also summary view of statistics, filterable:
sqldev_trace_stats

And a list view, filterable and orderable:

sqldev_trace_list

Some sort of right click summary for binds/waits might be a nice addition.


Exadata Documentation Available

Please join me in welcoming the Exadata product documentation to the internet. It’s been a long time coming, but glad it’s finally made an appearance!

Everything I Needed to Know About Enterprise Manager I Learned at Collaborate 2015

Collaborate 2015 at the Mandalay in Las Vegas is just around the corner and the sheer amount of Enterprise Manager focused content is phenomenal!  Oracle partners and power users around the world come together each year to provide the lucky attendees the best in use cases, tips and technical know -how regarding the best infrastructure management tool in the industry.

If you are one of the lucky masses who attend this incredible conference from IOUG, Quest and OAUG, you’ll be searching the massive schedule of sessions for the presenters that will get you to the level of expertise you desire in Enterprise Manager features and products.  So let’s save you a lot of time and just tell you about all the incredible sessions and offer you some links!

Sunday, April 12th

Our adventure starts out on Sunday with the IOUG Pre-Conference Workshops.  Join Werner De Gruyter, Courtney Llamas and me for a great hands on lab titled, “Everything I Ever Needed to Know About OEM, I Learned at Collaborate!”  from 9am-noon.   After you’ve refreshed with some lunch and networked with some of the other speakers and attendees, you can then head over to Brent Sloterbeek, from Gentex Corporation to sit in on his user experience session where he speaks on his great success expanding the use of EM12c from just the DBA staff to the entire IT department.

Monday, April 13th

Monday begins the impressive list of presenters on Enterprise Manager topics, beginning with my peer, the ever impressive Courtney Llamas explaining how you can take EM12c from “Zero to Manageability”.  The schedule has Werner and I as her co-presenters, but she knows this topic like the back of her hand, so if she does have us speak, it will mostly be both of us saying, “What she said!” J

The first session choosing challenge of the conference starts right here, folks!  The wonderful Rene Antunez, from Pythian is also presenting at the same time on Database as a Service in the Cloud. I pity those who have to choose between these two sessions and please don’t ask me to choose between coworkers and Rene, who I mentored.  It’s just too difficult for me to make one!

After Courtney and Rene, you can head over and see Rich Neimic speak on the Best in Database 12c Tuning Features, followed by another person I’ve mentored for years and am proud to be the mentor of, Golden Gate expert, Bobby Curtis, who’ll take some time off from GG to tell you about how to best implement Exachk for Exadata with EM12c.

At 3:15pm, we have another competition of sessions on EM12c that you need to choose between and I’m in the middle of it, too!  Ravi Madabhushanam, from App Associates will be discussing how Easy E-Business Suite patching can be with Enterprise Manager while I discuss one of the newest symbiotic product launches from the EM12c performance team, the Power of the AWR Warehouse.  I’m not finished yet-  3:15 is a popular session time!  Erik Benner, Steve Lemme,  Seth Miller, Rene Antunez, Charles Kim and Michael Timpanaro-Perrotta, (OK, so someone has a longer name than me! J)  will be on a panel discussing how IT can benefit and should consider Database as a Service! But wait-  that’s still not all at 3:15pm, (yeah, it’s going to be a tough decision…)  Michael Nelson from Northrop Grumman  will be talking about the importance and how to implement the Automatic Diagnostic Repository.

Monday finishes up with three great sessions at 4:15, (sorry folks, you have to make another difficult decision again… J)  One of my fantastic co-authors from the Expert Enterprise Manager 12c book is up to bat, Leighton Nelson,  Introducing Enterprise Manager .  It’s an excellent complement to Courtney’s session earlier in the day, so if you are new to the product, please consider this session.  For those of you who are looking for an advanced topics, we then have Kant Mrityunjay from AST Corporation discussing  Top Automation of Weblogic Tasks and for the database performance folks, we have Alfredo Krieg Villa from Sherwin Williams, discussing how to Stabilize Performance with SQL Plan Management in DB12c.

Monday evening offers you the chance to relax and network with fellow Enterprise Manager experts  at 5:30pm in the Exhibit Hall at the Oracle Demogrounds where you can view demos on Total Cloud Control and Applications Management, both powered by Oracle Enterprise Manager! It’s a great chance to speak with product managers, some of the Strategic Customer Program team members and Oracle community power users.

Tuesday, April 14th

Tuesday kicks off EM12c sessions at 9:45am with Ken Ramey from Centroid Systems focusing on security with Protect Your Identities, but will offer some bonus tuning tips, too!  At this same time, if you are looking for the best answers on how to upgrade to EM12c High Availability, head on over and see Bill Petro from American Express who will go over a real user experience on how it’s done right!

Bobby Curtis is back up to bat, (he’ll appreciate the baseball reference… J)  speaking on his topic of specialty, GoldenGate monitoring with EM12c.   Erick Mader and Jon Gilmore from Zirous, Inc. will be up at 11am to tell you all about Oracle WebLogic Performance Tuning.   You also have the chance to learn about the power of Oracle Real User Experience, (aka RUEI) from Frank Jordan of ERP Suites during this time on Tuesday.  Rich Niemic will be speaking at 11am, too, mostly on Database 12c features, but he does promise, since it is a session on the BEST Database 12c features, he will be covering a few Enterprise Manager features, too!

Alfredo Krieg Villa is back on Tuesday at 2pm, to discuss how to save your day with Enterprise Manager 12c Administration.  You lucky devils get to learn more about the glories of Automated Patching with EM12c from Fernando de Souza from General Dynamics IT at 4:30 to finish out your Tuesday OEM sessions.

Tuesday concludes with the Exhibitor Showcase Happy Hour in the Exhibit Hall at the Oracle Demogrounds where you get another chance at Cloud Control and Applications Management demos, along with great Las Vegas Mandalay venue drinks and food!

Wednesday, Aprile 15th

We start out Wednesday at 8am with Gleb Otochin from Pythian, (ask me about my first interaction with Gleb when we worked across from each other when I trained at the Ottawa office-  great story! J)  He’s a brilliant guy and he’s going to tell you all about how to build Your Own Private Cloud.  Also at 8am, Collaborate has Shawn Ruff from Mythics discussing how to Manage and Monitor Fusion Middleware while one of my co-authors on the Enterprise Manager Command Line Interface book, Ray Smith, is going to show you how to Flaunt it if You’ve Got it with EM12c Extensibility.  You also have Frank Pound from the Bank of Canada who’s going to show you how easy it is Comply with Audits using EM12c.

You get a small break from EM12c content, so take a breath, network or if you want, you can attend other sessions not on our EM12c list.  I’ll forgive you, really… J  At 10:45, Krishna Kapa from UBS is going to explain Oracle Database 12c Multitenant with Enterprise Manager 12c.

At 12:30, we then have the great IOUG Oracle Enterprise Manager 12c SIG.  If you are passionate about EM12c and want to get involved in the community, this is the place to be!  We are planning out this event, even as I write this and I can tell you, it’s going to be great!

At 2:45pm, the always wonderful, Kai Yu, from Dell, will show you how to Design and Implement Your Own Private Cloud.  During this same time, Claudia Naciff and Vladimir Lugo, of Loyola Marymount University, will be making my day by promoting  Migrating Your Cron Jobs to Oracle EM12c.  I love seeing folks simplifying management of their environments with Enterprise Manager and it’s great to see Claudia and Vladimir presenting on this topic.  Last, but not least in this time slot is  Raj Garrepally from Emory University, presenting how to use Oracle Enterprise Manager, along with other tools to ease the management of Peoplesoft Environments.

Erik Benner is up at 4pm to talk about  Servers and Systems and Storage, Oh My!  Don’t miss out learning about infrastructure management with EM12c.  It’s going to be another tough decision on sessions again at this 4pm time slot-  Leighton Nelson is also speaking with Sean Brown from World Wide Technology on Database as a Server with RAC using DB12c, along with a panel from Keith Baxter on Oracle Best Practices for Managing Oracle Applications.

Thursday, April 16th

The last day of Collaborate 2015 is made sunnier by starting the morning with Krishna Kapa from UBS as he gives another great session, this time on how to successfully combine Database 12c Multitenant Architecture with EM12c.   If Las Vegas isn’t sunny enough with the addition of that first session on the list, you also have Angelo Rosado from Oracle giving another session of great content on Managing Oracle E-Business Suite 12.2 with EM12c.

At 9:45, James Lui and Erik Benner are going to host the OAUG Oracle Enterprise Manager Applications SIG.  This is a great way to get involved with Fusion Middleware passionate folks in our community, provide feedback and be part of the best in the middleware world, (not to be confused with Tolkien’s middle earth, but we’ll know who’s who if anyone shows up dressed as an elf or , right? J)  Another great session on DBaaS is up during this time for all you EM12c fans from someone I’ve enjoyed presenting with at Oracle, GP Gongloor.  He’ll show you how the future is now with Advanced Database Management with Database as a Service and EM12c.

Mark Saltsman and Kumar Anthireyan from Bias are presenting at 11am, explaining the 12 Things to Consider for Migrating EBS onto Exadata, which will include some great tips with EM12c.  Mark Scardina from Oracle is going to present on Database Clouds with Oracle RAC 12c and as we know, cloud is the almighty world we live in now!  My previous boss and always mentor, Alex Gorbachev is on at this time, too, presenting on Anomaly Detection for Database  Monitoring.  His summary description describes it as a novel approach, but with Alex, should we expect anything less? J  All of these great presenters are up against one of my favorite peers at Oracle, Pete Sharman, who is the master of masters when it comes to Database Base as a Service.  He will discuss Data Cloning and Refreshes Made Easy with EM12c Snap Clone.  If you want to learn from the master on DBaaS, this is the guy!  To make things even more difficult, my team members from SCP, Courtney Llamas and Werner De Gruyter are presenting in this same time slot to take you Under the Hood of the Enterprise Manager.  This session covers a lot of what we do as part of the SCP team and I know how much I learned from both of these wonderful teammates when I joined Oracle.  If you are considering implementing EM12c, this is an invaluable session.

To close out the conference on EM12c sessions, Courtney Llamas will easily convince everyone the power of Smarter Monitoring with Adaptive Thresholds and Time Based Metrics.  For anyone who is looking to “silence the white noise” in their environment, this is a not-to-be-missed session.  Anthony Noriega will also be presenting at during this time slot on a great case study revolving around Moving ASM Files that includes EM12c, so let’s give these great presenters on the last day our support  and don’t miss out on all the great content that is going on till the very end!

AND…

For those of you who are interested in Social Media, I’ll also be on the Social Media panel on Monday at 4pm, (as soon as I finish my AWR Warehouse session, so if you see me running down the hall, you know where I’m headed!) and if you are a WIT, (Women in Technology), you’ll soon see the announcement that I’m your speaker for the Monday luncheon.  I won’t ruin all the surprise and I’ll leave it to IOUG to let you in on the special topic I’ll be presenting on!

If you want a “just the facts” list of all the EM12c sessions for the conference, go to the following link, which will provide everything you need to make it easy to add your choices to My Show Planner.  See you at IOUG Collaborate 2015 in April at the Mandalay!



Tags:  

Del.icio.us
Facebook
TweetThis
Digg
StumbleUpon


Copyright © DBA Kevlar [Everything I Needed to Know About Enterprise Manager I Learned at Collaborate 2015], All Right Reserved. 2015.

Real time sql monitoring – bulk bound statement

Moving through some observations of an ongoing investigation tackling the easiest ones first.

Previously I observed the slow execution of a recursive delete of statistics history but coming not from MMON but from a number of application processes which should have only been calculating stats for their own partitions.

Statement was sql id 9v9n97qj8z1dg:

	
delete /*+ dynamic_sampling(4) */
from sys.wri$_optstat_histhead_history
where savtime < :1
and rownum <= NVL(:2, rownum);

From the real time sql monitoring output of a single execution (DBMS_SQLTUNE.REPORT_SQL_MONITOR EXEC_ID parameter) , I observed that EXECS was greater than 1 and so the A-Rows was much larger than I expected for a statement which was deliberately restricting the number of rows affected by a ROWNUM predicate.

Reminder:

SQL Plan Monitoring Details (Plan Hash Value=2348801730)
=======================================================================================================================================================================================================
| Id   |            Operation             |             Name              |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read | Read  | Activity |            Activity Detail            |
|      |                                  |                               | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs | Bytes |   (%)    |              (# samples)              |
=======================================================================================================================================================================================================
|    0 | DELETE STATEMENT                 |                               |         |       |     50262 |   +101 |  1423 |        1 |      |       |          |                                       |
|    1 |   DELETE                         | WRI$_OPTSTAT_HISTHEAD_HISTORY |         |       |     50361 |     +2 |  1423 |        1 |      |       |    29.74 | enq: TX - row lock contention (14630) |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | buffer busy waits (231)               |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | Cpu (82)                              |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | resmgr:cpu quantum (5)                |
|    2 |    COUNT                         |                               |         |       |     50262 |   +101 |  1423 |      14M |      |       |     0.00 | Cpu (1)                               |
|    3 |     FILTER                       |                               |         |       |     50266 |    +97 |  1423 |      14M |      |       |    30.14 | Cpu (15146)                           |
|    4 |      TABLE ACCESS BY INDEX ROWID | WRI$_OPTSTAT_HISTHEAD_HISTORY |      6M | 23218 |     50253 |   +100 |  1423 |       4G |   22 | 176KB |    23.17 | buffer busy waits (18)                |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | Cpu (11627)                           |
| -> 5 |       INDEX RANGE SCAN           | I_WRI$_OPTSTAT_HH_ST          |    990K |  5827 |     50264 |   +101 |  1423 |       4G |   41 | 328KB |    16.94 | buffer busy waits (3)                 |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | Cpu (8332)                            |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | resmgr:cpu quantum (180)              |
=======================================================================================================================================================================================================
   3 - filter(ROWNUM<=NVL(:2,ROWNUM))
   5 - access("WRI$_OPTSTAT_HISTHEAD_HISTORY"."SYS_NC00024$"<SYS_EXTRACT_UTC(:1))

Initially I was stumped by the fact that A-Rows in steps 2-3 was greater than the ROWNUM predicate (1000) but that was before I noticed the number of EXECS and EXECS * 10000 = 14M so was this a report of a bulk/batched statement?

So, this is just an illustration that such a bulk bound statement will produce an output as per above.

Create a table with a few rows to slow things down:

drop table t3;

create table t3
as
with x as
(select rownum col1
 from   dual
 connect by level <= 1000)
select x1.col1
,      rpad('X',100,'X') col2
from   x x1, x x2;

Then run an update with FORALL to bulk bind:

declare
 v1 sys.odcinumberlist := sys.odcinumberlist();
begin
 for i in 1 .. 1000
 loop
     v1.extend();
     v1(i) := i;
 end loop;
 forall i in 1 .. v1.count
   update /*+ monitor domtest */ t3
   set    col2 = rpad('Y',100,'Y')
   where  col1 = v1(i);
 commit;
end;
/

Hunt down my SQL execution and get RTSM output of statement.

V$SQL reports this as one execution updating 1M rows.

select executions, rows_processed, elapsed_time/1000/1000, sql_text from v$sql where sql_id = '3tkqtzjyaa02g';

EXECUTIONS ROWS_PROCESSED ELAPSED_TIME/1000/1000 SQL_TEXT
---------- -------------- ---------------------- ------------------------------------------------------------------------------
         1        1000000             109.596429 UPDATE /*+ monitor domtest */ T3 SET COL2 = RPAD('Y',100,'Y') WHERE COL1 = :B1

RTSM reports it as 1044 (!?) executions and doesn’t seem to deal with the number of rows updated.
We only get the bind for the initial bind – not unexpected.

select dbms_sqltune.report_sql_monitor('3tkqtzjyaa02g',sql_exec_id => '16777216') from dual;
SQL Monitoring Report

SQL Text
------------------------------
UPDATE /*+ monitor domtest */ T3 SET COL2 = RPAD('Y',100,'Y') WHERE COL1 = :B1

Global Information
------------------------------
 Status              :  DONE                 
 Instance ID         :  1                    
 SQL ID              :  3tkqtzjyaa02g        
 SQL Execution ID    :  16777216             
 Execution Started   :  02/26/2015 13:25:29  
 First Refresh Time  :  02/26/2015 13:25:29  
 Last Refresh Time   :  02/26/2015 13:27:19  
 Duration            :  110s                 
 Module/Action       :  SQL Developer/-      
 Service             :  SYS$USERS            
 Program             :  SQL Developer        

Binds
========================================================================================================================
| Name | Position |  Type  |                                           Value                                           |
========================================================================================================================
| :B1  |        1 | NUMBER | 1                                                                                         |
========================================================================================================================

Global Stats
=========================================
| Elapsed |   Cpu   |  Other   | Buffer |
| Time(s) | Time(s) | Waits(s) |  Gets  |
=========================================
|     110 |     109 |     0.16 |    16M |
=========================================

SQL Plan Monitoring Details (Plan Hash Value=669554690)
==========================================================================================================================
| Id |      Operation       | Name |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Activity | Activity Detail |
|    |                      |      | (Estim) |      | Active(s) | Active |       | (Actual) |   (%)    |   (# samples)   |
==========================================================================================================================
|  0 | UPDATE STATEMENT     |      |         |      |           |        |  1044 |          |          |                 |
|  1 |   UPDATE             | T3   |         |      |       109 |     +2 |  1044 |        0 |     0.92 | Cpu (1)         |
|  2 |    TABLE ACCESS FULL | T3   |     171 | 3409 |       110 |     +1 |  1044 |       1M |    99.08 | Cpu (108)       |
==========================================================================================================================

By contrast a non-bulk bound FOR LOOP statement would register as 1000 executions in V$SQL and each a separate SQL_EXEC_ID.

However this does nothing to address the concern about what the heck is going on such that application processes executing gather_table_stats for a particular partition would each end up running this bulk indiscriminate cleanup script… heading that way next.


COUNT STOPKEY – NVL

Yesterday I mentioned issues with a recursive delete operation on statistics history.

This is a quick illustration of the last points I made on that post regarding the lack of a COUNT STOPKEY optimisation because of the use of NVL.

COUNT STOPKEY is an optimisation which allows processing to stop once the target number of rows has been reached.

For example:

create table t1 as select * from dba_objects;

alter session set statistics_level = all;

var rn number
exec :rn := 10;

select *
from   t1
where  rownum <= :rn
and    mod(object_id,5) = 0;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

Plan hash value: 3836375644

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |     10 |00:00:00.01 |       4 | 
|*  1 |  COUNT STOPKEY     |      |      1 |        |     10 |00:00:00.01 |       4 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |  26148 |     10 |00:00:00.01 |       4 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=:RN)
   2 - filter(MOD("OBJECT_ID",5)=0) 

COUNT STOPKEY kicks in after we have fetched the relevant rows and stops any unnecessary further execution – note in particular A-Rows & Buffers for STEP 2

However, if we use NVL around our ROWNUM limitation, then this optimisation is prevented.

var rn number
exec :rn := 10;

select *
from   t1
where  rownum <= nvl(:rn,rownum)
and    mod(object_id,5) = 0;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

Plan hash value: 624922415

--------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |     10 |00:00:00.09 |    2310 |
|   1 |  COUNT              |      |      1 |        |     10 |00:00:00.09 |    2310 |
|*  2 |   FILTER            |      |      1 |        |     10 |00:00:00.09 |    2310 |
|*  3 |    TABLE ACCESS FULL| T1   |      1 |  26148 |  29630 |00:00:00.08 |    2310 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(ROWNUM<=NVL(:RN,ROWNUM))
   3 - filter(MOD("OBJECT_ID",5)=0)


Understanding vs Resolution – Statistics History Cleanup

Today I helped resolve a problem quickly but to the detriment of my understanding of exactly what was going on and why.

And as part of the quicker resolution, I had to destroy the evidence which would have helped understand better.

So… now need to go back and figure it out if I have time to model it properly, etc.

Here’s what little I know so far.

What happened was that there were a number of application sessions experiencing slowness and contention when executing a recursive stats history cleanup statement.

Version is 11.2.0.3

Statement was sql id 9v9n97qj8z1dg:

delete /*+ dynamic_sampling(4) */ from sys.wri$_optstat_histhead_history where savtime < :1 and rownum <= NVL(:2, rownum);

First up, resolution was quick and easy according to, by a quirk of coincidence, my last post:

https://orastory.wordpress.com/2015/02/16/dbms_stats-purge_stats/

We just got rid of all the statistics history using the magic PURGE_ALL truncate flag.
It is of limited usefulness day-to-day anyway.

The slowness was holding up ETL jobs which were just trying to calculate stats for their own partition.

I was brought into the situation towards the end of the situation but here’s an illustration of the slowness:

SQL Text
------------------------------
delete /*+ dynamic_sampling(4) */ from sys.wri$_optstat_histhead_history where savtime < :1 and rownum <= NVL(:2, rownum)

Global Information
------------------------------
 Status              :  EXECUTING               
 Instance ID         :  1                       
 SQL ID              :  9v9n97qj8z1dg           
 SQL Execution ID    :  16777249                
 Execution Started   :  02/24/2015 19:11:25     
 First Refresh Time  :  02/24/2015 19:13:06     
 Last Refresh Time   :  02/25/2015 09:10:35     
 Duration            :  50351s                  
 Module/Action       :  JDBC Thin Client/-      
 Program             :  JDBC Thin Client        

Binds
========================================================================================================================
| Name | Position |  Type  |                                           Value                                           |
========================================================================================================================
| :2   |        2 | NUMBER | 10000                                                                                     |
========================================================================================================================

Global Stats
===============================================================================================
| Elapsed |   Cpu   |    IO    | Application | Concurrency |  Other   | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   |  Waits(s)   | Waits(s) |  Gets  | Reqs | Bytes |
===============================================================================================
|   50359 |   35199 |     0.16 |       14669 |         254 |      237 |     1G |   63 | 504KB |
===============================================================================================

SQL Plan Monitoring Details (Plan Hash Value=2348801730)
=======================================================================================================================================================================================================
| Id   |            Operation             |             Name              |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read | Read  | Activity |            Activity Detail            |
|      |                                  |                               | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs | Bytes |   (%)    |              (# samples)              |
=======================================================================================================================================================================================================
|    0 | DELETE STATEMENT                 |                               |         |       |     50262 |   +101 |  1423 |        1 |      |       |          |                                       |
|    1 |   DELETE                         | WRI$_OPTSTAT_HISTHEAD_HISTORY |         |       |     50361 |     +2 |  1423 |        1 |      |       |    29.74 | enq: TX - row lock contention (14630) |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | buffer busy waits (231)               |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | Cpu (82)                              |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | resmgr:cpu quantum (5)                |
|    2 |    COUNT                         |                               |         |       |     50262 |   +101 |  1423 |      14M |      |       |     0.00 | Cpu (1)                               |
|    3 |     FILTER                       |                               |         |       |     50266 |    +97 |  1423 |      14M |      |       |    30.14 | Cpu (15146)                           |
|    4 |      TABLE ACCESS BY INDEX ROWID | WRI$_OPTSTAT_HISTHEAD_HISTORY |      6M | 23218 |     50253 |   +100 |  1423 |       4G |   22 | 176KB |    23.17 | buffer busy waits (18)                |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | Cpu (11627)                           |
| -> 5 |       INDEX RANGE SCAN           | I_WRI$_OPTSTAT_HH_ST          |    990K |  5827 |     50264 |   +101 |  1423 |       4G |   41 | 328KB |    16.94 | buffer busy waits (3)                 |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | Cpu (8332)                            |
|      |                                  |                               |         |       |           |        |       |          |      |       |          | resmgr:cpu quantum (180)              |
=======================================================================================================================================================================================================
   3 - filter(ROWNUM<=NVL(:2,ROWNUM))
   5 - access("WRI$_OPTSTAT_HISTHEAD_HISTORY"."SYS_NC00024$"<SYS_EXTRACT_UTC(:1))

So, first thought was that:

1. I thought that MMON was responsible for purging old stats… clearly not in this case as there were multiple concurrent application connections purging old data as part of their ETL process.

2. The DELETE is deleting any old history older than a parameterised date, the first 10000 rows thereof. There is no connection to the object on whose statistics the application ETL jobs were working on.

3. I would expect a COUNT STOPKEY operation but then the (careless?) NVL predicate avoids this optimisation.

4. As a consequence (?), wouldn’t we normally expect to see that STOPKEY optimisation stop at 10000 rows and show that in “Rows (Actual)”? But it is reported as 14M….

More questions than answers…


Virtual CPUs with Google Compute Engine

Continuing on my series of virtual CPU behavior in Amazon Web Services, Amazon Web Service HVM instances, and Microsoft Azure, I’m taking a look at Google Compute Engine (GCE). GCE is a relative newcomer to the public cloud world, become generally available in December 2013. It does have some interesting features, including transparent maintenance through live migration, and automatic sustained-use discounts without upfront commitments.

Unlike Amazon or Microsoft, Google is very upfront about their vCPU definition.

For the n1 series of machine types, a virtual CPU is implemented as a single hyperthread on a 2.6GHz Intel Sandy Bridge Xeon or Intel Ivy Bridge Xeon (or newer) processor. This means that the n1-standard-2 machine type will see a whole physical core.

I still believe calling such a hyperthread a “virtual CPU” is misleading. When creating a virtual machine in a non-cloud VM platform, 1 virtual CPU = 1 physical core. Plain and simple. But when using a cloud platform, I need 2 virtual CPUs to get that same physical core.

cpu-comparison

Anyways, off to run some CPU tests. n1-standard-4 is a close match to the m3.xlarge instances previously tested, so I’ll try that.

Getting set up on Google Compute Engine

I already signed up with Google Compute Engine’s free trial and created a project I’m calling marc-cpu-test. Installing the gcloud compute command-line tools.

[marc@quartz ~]$ gcloud auth login --no-launch-browser
Go to the following link in your browser:

https://accounts.google.com/o/oauth2/auth?redirect_uri=urn%3Aietf%(redacted)&access_type=offline

Enter verification code: (redacted)
Saved Application Default Credentials.

You are now logged in as [fielding@pythian.com].
Your current project is [None].  You can change this setting by running:
  $ gcloud config set project PROJECT
[marc@quartz ~]$ gcloud config set project marc-cputest
[marc@quartz ~]$ gcloud config set compute/zone us-central1-a
[marc@quartz ~]$ gcloud compute instances create cpu-test-n4 --image centos-6 --machine-type "n1-standard-4" --zone us-central1-a
Created [https://www.googleapis.com/compute/v1/projects/marc-cputest/zones/us-central1-a/instances/cpu-test-n4].
NAME        ZONE          MACHINE_TYPE  INTERNAL_IP    EXTERNAL_IP   STATUS
cpu-test-n4 us-central1-a n1-standard-4 10.240.222.194 104.154.75.96 RUNNING
[marc@quartz ~]$ gcloud compute ssh cpu-test-n4
WARNING: You do not have an SSH key for Google Compute Engine.
WARNING: [/usr/bin/ssh-keygen] will be executed to generate a key.
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/marc/.ssh/google_compute_engine.
Your public key has been saved in /home/marc/.ssh/google_compute_engine.pub.
The key fingerprint is:
(redacted)
Updated [https://www.googleapis.com/compute/v1/projects/marc-cputest].
Warning: Permanently added '104.154.75.96' (RSA) to the list of known hosts.
Warning: Permanently added '104.154.75.96' (RSA) to the list of known hosts.

OK, instance all set and connected. As a CentOS 6 image it doesn’t allow SSH root logins by default, so attempting to set up a gcloud environment as a root user will get you “permission denied” errors on SSH. Serves me right for trying to run these tools as root in the first place :-).

Looking around

Checking what they got us:

[marc@cpu-test-n4 ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo
processor       : 0
model name      : Intel(R) Xeon(R) CPU @ 2.60GHz
cpu MHz         : 2600.000
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 2
processor       : 1
model name      : Intel(R) Xeon(R) CPU @ 2.60GHz
cpu MHz         : 2600.000
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 2
processor       : 2
model name      : Intel(R) Xeon(R) CPU @ 2.60GHz
cpu MHz         : 2600.000
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 2
processor       : 3
model name      : Intel(R) Xeon(R) CPU @ 2.60GHz
cpu MHz         : 2600.000
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 2

Google has redacted the exact CPU model numbers, but has clearly marked this as a 2-core system with core IDs 0 and 1.

The single-CPU case

[marc@cpu-test-n4 ~]$ taskset -pc 0 $$
pid 1558's current affinity list: 0-3
pid 1558's new affinity list: 0
[marc@cpu-test-n4 ~]$  dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null
2170552320 bytes (2.2 GB) copied, 14.3427 s, 151 MB/s
[marc@cpu-test-n4 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 29.3081 s, 74.1 MB/s
2170552320 bytes (2.2 GB) copied, 29.3065 s, 74.1 MB/s

We get a nice boost in raw CPU numbers as compared to the 120 MB/s I saw in AWS. With two processes sharing this CPU, see a tiny bit less than half the throughput.

Sharing the cores (or trying to)

[marc@cpu-test-n4 ~]$ taskset -pc 0,1 $$
pid 1558's current affinity list: 0
pid 1558's new affinity list: 0,1
[marc@cpu-test-n4 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c &gt; /dev/null &amp; done
[1] 1803
[2] 1805
[marc@cpu-test-n4 ~]$ 2170552320 bytes (2.2 GB) copied, 14.6959 s, 148 MB/s
2170552320 bytes (2.2 GB) copied, 14.7139 s, 148 MB/s

This is interesting; we see almost full-speed throughput in spite of processors 0 and 1 said to have a shared core. With processors 0 and 2 the situation is the same.

[marc@cpu-test-n4 ~]$ taskset -pc 0,2 $$
pid 1558's current affinity list: 0,1
pid 1558's new affinity list: 0,2
[marc@cpu-test-n4 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c &gt; /dev/null &amp; done
[1] 1830
[2] 1833
[marc@cpu-test-n4 ~]$ 2170552320 bytes (2.2 GB) copied, 14.6683 s, 148 MB/s
2170552320 bytes (2.2 GB) copied, 14.6692 s, 148 MB/s

Is the CPU scheduler ignoring my taskset commands? Running mpstat 2-second samples during the test to see actual CPU usage:

[marc@cpu-test-n4 ~]$ mpstat -P ALL 2
...
06:08:44 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
06:08:46 PM  all   46.31    0.00    3.75    0.00    0.00    0.00    0.00    0.00   49.94
06:08:46 PM    0   93.50    0.00    6.50    0.00    0.00    0.00    0.00    0.00    0.00
06:08:46 PM    1   92.00    0.00    8.00    0.00    0.00    0.00    0.00    0.00    0.00
06:08:46 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
06:08:46 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
...
06:08:52 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
06:08:54 PM  all   46.75    0.00    3.25    0.00    0.00    0.00    0.00    0.00   50.00
06:08:54 PM    0   93.47    0.00    6.53    0.00    0.00    0.00    0.00    0.00    0.00
06:08:54 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
06:08:54 PM    2   93.50    0.00    6.50    0.00    0.00    0.00    0.00    0.00    0.00
06:08:54 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

So the taskset commands are working: when we ask for CPUs 0 and 1, we are getting them, but throughput shows that cores aren’t being shared. It means that the CPUs in the virtual machine are not statically bound to hardware threads as seen under AWS. I’d call it a win as it gets more consistent performance even if the guest operating system is forced to makes poor CPU scheduling decisions as in this case.

[marc@cpu-test-n4 ~]$ taskset -pc 0-3 $$
pid 1558's current affinity list: 0,2
pid 1558's new affinity list: 0-3
[marc@cpu-test-n4 ~]$ for i in {1..4}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 22.9823 s, 94.4 MB/s
2170552320 bytes (2.2 GB) copied, 22.9914 s, 94.4 MB/s
2170552320 bytes (2.2 GB) copied, 22.9915 s, 94.4 MB/s
2170552320 bytes (2.2 GB) copied, 23.1333 s, 93.8 MB/s

This is more the throughput we would expect with two shared cores.

Lessons learned

Over the course of these tests, I’ve discovered a few things:

  • Although they share virtual CPUs like competitors, Google is very upfront about this behavior.
  • Actual throughput for a simple gzip workload is excellent.
  • Google Compute Engine has an abstraction layer in front of CPUs that dynamically schedules tasks between CPU threads, in addition to the regular scheduler in the virtual machine. In my testing, it allocates tasks efficiently across CPU cores, even when the OS scheduler is configured suboptimally.