Why metrics? Since I joined 7digital I've seen the API grow from a brand new feature side by side with the (then abundant) websites to be the main focus of the company. The traffic grew and grew and keeps on growing in an accelerated pace and that brings us new challenges. We've brought the agile perspective into play which has made us adapt faster and make fewer errors but:

  • We can do unit tests but they don't bring out the behaviour.
  • We can do integration tests but they won't show the whole flow.
  • We can do smoke tests but they won't show us realistic usage.
  • We can do load test but they won't have realistic weighting.

Even when we do acceptance criteria we are actually being driven by assumptions, even with an experienced developer he is really just sampling all his previous work and as we move to a larger number of servers and applications it's not humanly possible to take all variables into consideration. It is common to hear statements like 'keep an eye on the error log/server log/payments log when releasing this new feature' but when something breaks it's all about 'what was released/when was it released/is it a specific server?'. As the data grows it becomes harder to sample and deduce from it quickly enough to feedback without causing issues, especially when agile tends to implement intermediary solutions which might have different behaviours from the final solution that have not been studied. The truth is that nothing replaces real life data and statistics – including developers opinions – if it the issue is a black swan then we need to churn out usable information fast!



Taken from @gregyoung This has been seen before by other companies; for example, Flickr on their Counting and Timing blog post. See also Building Scalable Websites by Flickr's Cal Henderson. This advice has been followed by other companies like Etsy on their Measure Anything Measure Everything blog post or Shopify on their StatsD blog post. How to do it? Decided to start with a winning horse I picked up the tools used by these companies: StatsD is described as “a network daemon for aggregating statistics (counters and timers), rolling them up, then sending them to graphite”. Graphite is described as “a highly scalable real-time graphing system. As a user, you write an application that collects numeric time-series data that you are interested i[...]. The data can then be visualized through graphite's web interfaces.” The way to implement these is available in several tutorials and I used StatsD own example C# client to poll our own API request log for API users, endpoints used, caching and errors. In the future it would be ideal for the application to access StatsD itself instead of running a polling daemon. There are a lot of usable features on Graphite. The ones I've used so far include Moving Average which will smooth out spikes in the graphs making it easier to see behaviour trends in a short time range and Sort by Maxima. There are even tools to forecast future behaviour and growth using Holt Winters Forecasting Statistics and this is used by companies to understand future scalability and performance requirements based on data from previous weeks, months or years (seen in this Etsy presentation on Metrics) How it looks and some findings Right away I got some usable results. An API client had a bug in their implementation which meant they required a specific endpoint more often than they would use it – this data can help out with debugging and also prevent abuse.

Sampled and smoothed usage per endpoint per API user...

Another useful graph is error rates, which might be linked with abuse, deploying new features or other causes.

 Error chart smoothed with a few spikes but even those are on the 0.001 % rate

Here is some useful caching information per endpoint to know how to tune up TTLs or look for stampede behaviour.

Sampled and smoothed Cache Miss per Endpoint

Opinion After you start using live data to provide feedback for your work there is no going back. It is my opinion that analysis of short and long term live results of any type of work should be mandatory as we move out of an environment that is small enough to be maintained exclusively by a team's knowledge.

Tag: 
Agile Development
sharri.morris@7digital.com
Thursday, September 20, 2012 - 16:14

Over the last month we've started using ServiceStack for a couple of our api endpoints. We're hosting these projects on a debian squeeze vm using nginx and mono. We ran into various problems along the way. Here's a breakdown of what we found and how we solved the issues we ran into. Hopefully you'll find this useful. (We'll cover deployment/infrastructure details in a second post.)

Overriding the defaults

Some of the defaults for ServiceStack are in my opinion not well suited to writing an api. This is probably down to the frameworks desire to be a complete web framework. Here's our current default implementation of AppHost:

 

For me, the biggest annoyance was trying to find the DefaultContentType setting. I found some of the settings unintuitive to find, but it's not like you have to do it very often!

Timing requests with StatsD

As you can see, we've added a StatsD feature which was very easy to add. It basically times how long each request took and logs it to statsD. Here's how we did it:

 

It would have been nicer if we could wrap the request handler but that kind of pipeline is foreign to the framework and as such you need to subscribe to the begin and end messages. There's probably a better way of recording the time spent but hey ho it works for us.

sharri.morris@7digital.com
Sunday, September 16, 2012 - 11:31

At 7digital we use Ajax to update our basket without needing to refresh the page. This provides a smoother experience for the user, but makes it a little more effort to automate our acceptance tests with [Watir](http://wtr.rubyforge.org/). Using timeouts is one way to wait for the basket to render, but it has two issues. If the timeout is too high, it forces all your tests to run slowly even if the underlying callback responds quickly. However if the timeout is too low, you risk intermittent fails any time the callback responds slowly. To avoid this you can use the [Watir `wait_until` method](http://wtr.rubyforge.org/rdoc/classes/Watir/Waiter.html#M000343), to poll for a situation where you know the callback has succeeded. This is more inline with how a real user will behave. ### Example

sharri.morris@7digital.com
Friday, September 14, 2012 - 13:21

At 7digital we use [Cucumber](http://cukes.info/) and [Watir](http://wtr.rubyforge.org/) for running acceptance tests on some of our websites. These tests can help greatly in spotting problems with configuration, databases, load balancing, etc that unit testing misses. But because the tests exercise the whole system, from the browser all the way through the the database, they can tend be flakier than unit tests. Then can fail one minute and work the next, which can make debugging them a nightmare. So, to make the task of spotting the cause of failing acceptance tests easier, how about we set up Cucumber to take a screenshot of the desktop (and therefore browser) any time a scenario fails. ## Install Screenshot Software The first thing we need to do is install something that can take screenshots. The simplest solution I found is a tiny little windows app called [SnapIt](http://90kts.com/blog/2008/capturing-screenshots-in-watir/). It takes a single screenshot of the primary screen and saves it to a location of your choice. No more, no less. * [Download SnapIt](http://90kts.com/blog/wp-content/uploads/2008/06/snapit.exe) and save it a known location (e.g.

sharri.morris@7digital.com
Monday, September 3, 2012 - 11:51

[TeamCity](http://www.jetbrains.com/teamcity/) is a great continuous integration server, and has brilliant built in support for running [NUnit](http://www.nunit.org/) tests. The web interface updates automatically as each test is run, and gives immediate feedback on which tests have failed without waiting for the entire suite to finish. It also keeps track of tests over multiple builds, showing you exactly when each test first failed, how often they fail etc. If like me you are using [Cucumber](http://cukes.info/) to run your acceptance tests, wouldn't it be great to get the same level of TeamCity integration for every Cucumber test. Well now you can, using the `TeamCity::Cucumber::Formatter` from the TeamCity 5.0 EAP release. JetBrains, the makers of TeamCity, released a [blog post demostrating the Cucumber test integration](http://blogs.jetbrains.com/ruby/2009/08/testing-rubymine-with-cucumber/), but without any details in how to set it up yourself. So I'll take you through it here.