Why metrics? Since I joined 7digital I've seen the API grow from a brand new feature side by side with the (then abundant) websites to be the main focus of the company. The traffic grew and grew and keeps on growing in an accelerated pace and that brings us new challenges. We've brought the agile perspective into play which has made us adapt faster and make fewer errors but:

  • We can do unit tests but they don't bring out the behaviour.
  • We can do integration tests but they won't show the whole flow.
  • We can do smoke tests but they won't show us realistic usage.
  • We can do load test but they won't have realistic weighting.

Even when we do acceptance criteria we are actually being driven by assumptions, even with an experienced developer he is really just sampling all his previous work and as we move to a larger number of servers and applications it's not humanly possible to take all variables into consideration. It is common to hear statements like 'keep an eye on the error log/server log/payments log when releasing this new feature' but when something breaks it's all about 'what was released/when was it released/is it a specific server?'. As the data grows it becomes harder to sample and deduce from it quickly enough to feedback without causing issues, especially when agile tends to implement intermediary solutions which might have different behaviours from the final solution that have not been studied. The truth is that nothing replaces real life data and statistics – including developers opinions – if it the issue is a black swan then we need to churn out usable information fast!



Taken from @gregyoung This has been seen before by other companies; for example, Flickr on their Counting and Timing blog post. See also Building Scalable Websites by Flickr's Cal Henderson. This advice has been followed by other companies like Etsy on their Measure Anything Measure Everything blog post or Shopify on their StatsD blog post. How to do it? Decided to start with a winning horse I picked up the tools used by these companies: StatsD is described as “a network daemon for aggregating statistics (counters and timers), rolling them up, then sending them to graphite”. Graphite is described as “a highly scalable real-time graphing system. As a user, you write an application that collects numeric time-series data that you are interested i[...]. The data can then be visualized through graphite's web interfaces.” The way to implement these is available in several tutorials and I used StatsD own example C# client to poll our own API request log for API users, endpoints used, caching and errors. In the future it would be ideal for the application to access StatsD itself instead of running a polling daemon. There are a lot of usable features on Graphite. The ones I've used so far include Moving Average which will smooth out spikes in the graphs making it easier to see behaviour trends in a short time range and Sort by Maxima. There are even tools to forecast future behaviour and growth using Holt Winters Forecasting Statistics and this is used by companies to understand future scalability and performance requirements based on data from previous weeks, months or years (seen in this Etsy presentation on Metrics) How it looks and some findings Right away I got some usable results. An API client had a bug in their implementation which meant they required a specific endpoint more often than they would use it – this data can help out with debugging and also prevent abuse.

Sampled and smoothed usage per endpoint per API user...

Another useful graph is error rates, which might be linked with abuse, deploying new features or other causes.

 Error chart smoothed with a few spikes but even those are on the 0.001 % rate

Here is some useful caching information per endpoint to know how to tune up TTLs or look for stampede behaviour.

Sampled and smoothed Cache Miss per Endpoint

Opinion After you start using live data to provide feedback for your work there is no going back. It is my opinion that analysis of short and long term live results of any type of work should be mandatory as we move out of an environment that is small enough to be maintained exclusively by a team's knowledge.

Tag: 
Agile Development
sharri.morris@7digital.com
Saturday, July 7, 2012 - 13:03

We have recently been working on an incremental indexer for our Solr based search implementation, which was being updated sporadically due to the time it took to perform a complete re-index; it was taking about 5 days to create the 13GB of XML, zip, upload to the server, unzip and then re-index. We have created a Windows service which queries a denormalised data structure using NHibernate. We then use SolrNet to create our Solr documents and push them to the server in batches.

Solr Update Process

sharri.morris@7digital.com
Friday, March 2, 2012 - 11:47

After having read the o’Reilly book “REST in Practice” , I set myself the challenge of using OpenRasta to create a basic RESTful web service. I decided for the first day to just concentrate on getting a basic CRUD app as outlined in chapter 4 working. This involved the ability to create, read, update and delete physical file xml representations of Artists. It is described in the book as a Level 2 application on Richardson’s maturity model, as it doesn’t make use of Hypermedia yet. One reason why OpenRasta is such a good framework to implement a RESTful service is that it deals with “resources” and their representations. As outlined in “REST in Practice”, a resource is defined as any resource accessible via a URI, and OpenRasta deals with this perfectly as it was built to handle this model from the ground up.

The Basic Web Service

sharri.morris@7digital.com
Thursday, February 2, 2012 - 17:05

When bootstrapping a structure map registry, you are able to set the "life style"  of that particular instance using Structuremaps fluent interface. For example, when using NHibernate, it is essential that you set up ISessionFactory to be a Singleton and ISession to be on a per Http Request basis (achievable with StructureMaps HybridHttpOrThreadLocalScoped directive). Example:

For() .Singleton() .Use(SessionFactoryBuilder.BuildFor("MY.DSN.NAME", typeof(TokenMap).Assembly)) .Named("MyInstanceName");
For() .HybridHttpOrThreadLocalScoped() .Use(context =>; context.GetInstance("MyInstanceName") .OpenSession()) .Named("MyInstanceName");
It's nice and easy to test a Singleton was created with a Unit Test like so:

[TestFixtureSetUp] public void FixtureSetup(){ ObjectFactory.Initialize(ctx => ctx.AddRegistry(new NHibernateRegistry())); } [Test] public void SessionBuilder_should_be_singleton(){ var sessionBuilder1 = ObjectFactory.GetInstance(); var sessionBuilder2 = ObjectFactory.GetInstance(); Assert.That(sessionBuilder1, Is.SameAs(sessionBuilder2)); }

sharri.morris@7digital.com
Wednesday, February 1, 2012 - 15:42

Introduction

We have been using Solr for a while for search, Solr is fantastic, but the way we get our data into Solr is not so good. The DB is checked for new/updated/removed
content, then written into a jobs table, which is checked to see if there are any pending jobs. There are numerous issues with using a DB table as a queue, some for MySQL are listed at:

http://www.engineyard.com/blog/2011/5-subtle-ways-youre-using-mysql-as-a...

To stop using our DB as a queue I decided to test out setting up and using an AMQP based message queue. AMQP is an open standard for passing messages via queues. The finally goal would be to allow other teams to push high priority updates or new content directly to the queue rather than have to go through the DB, which can add considerable latency to the system.

For this test RabbitMQ was used, as it has a .Net library and it runs on virtually all OSs, has good language support, and good documentation. This can be found at the RabbitMQ site: http://www.rabbitmq.com/

Getting Started

I strongly advise reading these before you start:
http://www.rabbitmq.com/install-windows.html
and