Why metrics? Since I joined 7digital I've seen the API grow from a brand new feature side by side with the (then abundant) websites to be the main focus of the company. The traffic grew and grew and keeps on growing in an accelerated pace and that brings us new challenges. We've brought the agile perspective into play which has made us adapt faster and make fewer errors but:

  • We can do unit tests but they don't bring out the behaviour.
  • We can do integration tests but they won't show the whole flow.
  • We can do smoke tests but they won't show us realistic usage.
  • We can do load test but they won't have realistic weighting.

Even when we do acceptance criteria we are actually being driven by assumptions, even with an experienced developer he is really just sampling all his previous work and as we move to a larger number of servers and applications it's not humanly possible to take all variables into consideration. It is common to hear statements like 'keep an eye on the error log/server log/payments log when releasing this new feature' but when something breaks it's all about 'what was released/when was it released/is it a specific server?'. As the data grows it becomes harder to sample and deduce from it quickly enough to feedback without causing issues, especially when agile tends to implement intermediary solutions which might have different behaviours from the final solution that have not been studied. The truth is that nothing replaces real life data and statistics – including developers opinions – if it the issue is a black swan then we need to churn out usable information fast!



Taken from @gregyoung This has been seen before by other companies; for example, Flickr on their Counting and Timing blog post. See also Building Scalable Websites by Flickr's Cal Henderson. This advice has been followed by other companies like Etsy on their Measure Anything Measure Everything blog post or Shopify on their StatsD blog post. How to do it? Decided to start with a winning horse I picked up the tools used by these companies: StatsD is described as “a network daemon for aggregating statistics (counters and timers), rolling them up, then sending them to graphite”. Graphite is described as “a highly scalable real-time graphing system. As a user, you write an application that collects numeric time-series data that you are interested i[...]. The data can then be visualized through graphite's web interfaces.” The way to implement these is available in several tutorials and I used StatsD own example C# client to poll our own API request log for API users, endpoints used, caching and errors. In the future it would be ideal for the application to access StatsD itself instead of running a polling daemon. There are a lot of usable features on Graphite. The ones I've used so far include Moving Average which will smooth out spikes in the graphs making it easier to see behaviour trends in a short time range and Sort by Maxima. There are even tools to forecast future behaviour and growth using Holt Winters Forecasting Statistics and this is used by companies to understand future scalability and performance requirements based on data from previous weeks, months or years (seen in this Etsy presentation on Metrics) How it looks and some findings Right away I got some usable results. An API client had a bug in their implementation which meant they required a specific endpoint more often than they would use it – this data can help out with debugging and also prevent abuse.

Sampled and smoothed usage per endpoint per API user...

Another useful graph is error rates, which might be linked with abuse, deploying new features or other causes.

 Error chart smoothed with a few spikes but even those are on the 0.001 % rate

Here is some useful caching information per endpoint to know how to tune up TTLs or look for stampede behaviour.

Sampled and smoothed Cache Miss per Endpoint

Opinion After you start using live data to provide feedback for your work there is no going back. It is my opinion that analysis of short and long term live results of any type of work should be mandatory as we move out of an environment that is small enough to be maintained exclusively by a team's knowledge.

Tag: 
Agile Development
sharri.morris@7digital.com
Thursday, May 8, 2014 - 17:28

Astro Malaysia held it’s annual GoInnovate Challenge Hackathon on the 10th-12th October at the Malaysian Global Innovation & Creativity Centre (MaGIC).

Hopefuls from all over Malaysia massed together for an exciting challenge set by Astro - to build a radio streaming demo. The demo product was meant to redefine the way we watch, read, listen and play with content in two unique hacks to be completed within a 48 hour deadline. Astro offered substantial rewards to those whose ideas that came out on top!

Day 0: Demo - Friday evening

Attendees ranged from junior developers to start-up teams, so long as you’re 18 years old, you can take part!

To begin the Hackathon, entrants were fully briefed and given access to the APIs of both 7digital and music metadata company, Gracenote.

7digital’s lead API developer, Marco Bettiolo, flew in to act as Tech Support for the hackathon.

This photo shows Marco presenting a demo of a radio style streaming service he had previously built.

Day 1: Get Building!

According to the brief, hackers had to choose one of two innovative challenges:

sharri.morris@7digital.com
Tuesday, May 6, 2014 - 17:43

Managing session lifecycle is reasonably simple in a web application, with a myriad of ways to implement session-per-request. But when it comes to desktop apps, or Windows services, things are a lot less clear cut.

Our first attempt used NHibernate's "contextual sessions": when we needed a session we opened a new one, bound it to the current thread, did some work, and unbound the session.

We accomplished this with some PostSharp (an AOP framework) magic. A TransactionAttribute would open the session and start a transaction before the method was called, commit the transaction (or rollback if an exception had occurred), and dispose of the session after the method had completed.

It was a neat solution, and it was very easy to slap the attribute on a method and hey presto - instant session! On the other hand it was difficult to test, and to comprehend (if you weren't involved in the first place), and to avoid long transactions we found ourselves re-attaching objects to new sessions.

These concerns made us feel there was a better solution out there, and the next couple of projects provided some inspiration.

sharri.morris@7digital.com
Thursday, August 8, 2013 - 16:04

Last year we published data on the productivity of our development team at 7digital, which you can read about here.

We've completed the productivity report for this year and would again like to share this with you. We've now been collecting data from teams for over 4 years with just under 4,000 data points collected over that time. This report is from April 2012 to April 2013.

New to this year is data on the historical team size (from January 2010), which has allowed us to look at the ratio of items completed to the size of the team and how the team size compares to productivity. There's also some analysis of long term trends over the entire 4 years.

In general the statistics are very positive and show significant improvements in all measurements against the last reported period:

sharri.morris@7digital.com
Friday, July 19, 2013 - 14:55

Blue and green servers. What?

As part of the 7digital web team's automated deployment process, we now have “Blue-green servers” It took a while to do, but it's great for continuously delivering software.

This system is also known as “red/black deployments” but we preferred the blue-green name as “red” might suggest an error or fault state. You could pick any two colours that you like.

How it works is that we have two banks of web servers – the green servers, and the blue servers. Other than the server names, they’re the same. Only one of these banks is live at any one time, but we could put both live if extra-ordinary load called for it. A new version of the site is deployed to the non-live bank, and then “going live” with the new version consists of flipping a setting on the load balancer to make the non-live bank live and vice-versa.

Why?

Why did we do this? Mostly for the speed. The previous process of deploying a new site version was getting longer. The deployment script would start with a server, upload a new version of the site to it, unpack the new website files, stop the existing web site, configure the new website and start it. Then move on to the next server and do the same.