We have recently been working on an incremental indexer for our Solr based search implementation, which was being updated sporadically due to the time it took to perform a complete re-index; it was taking about 5 days to create the 13GB of XML, zip, upload to the server, unzip and then re-index. We have created a Windows service which queries a denormalised data structure using NHibernate. We then use SolrNet to create our Solr documents and push them to the server in batches.

Solr Update Process

When updating a Solr index you post xml documents to your index using the /update url. However these documents are not immediately searchable and need to be made available by using the commit command. This flushes the documents to disk and then starts up and registers a new searcher which is then able to see these changes. The new documents are committed to the last segment of the index, but over time the performance of the index will degrade as new segments are created by Lucene. The index should be optimised after a large amount of data has been added/changed using the optimize command

Commit Strategies

We did some tweaking during development to try and find the sweet spot for pushing out updates. In our test environments we found the optimal to be committing batches of 20k documents and calling the optimise command after every 5 batches. We made the mistake of not initially testing against the full 13Gb index in our test environments so when the service was deployed to live we found that the commits were taking around 4 mins causing the service to timeout and mark the updates as failed in our database. Even though the call from SolrNet timed out, the Solr instance still continues to process the commit successfully, so we used the optional waitflush and waitsearcher attributes so that our service wouldn't have to wait for a response from Solr. This caused its own problems as we were then throwing numerous commit commands which were then blocking any further adds from being processed. So we needed to reduce the time taken to perform a commit.

Server Configuration

While "Googling" the answers to our problems I came across Jay Hill's post which details some of the most common pitfalls around using Solr. Some of these stem from using the "out-of-the-box" configuration which we were certainly guilty of doing. We started by reducing the autoWarmCount on our caches to zero. This specifies how many objects from the currently running searchers cache should be copied to the new searcher.

<filterCache class="solr.LRUCache" size="256" initialSize="128" autowarmCount="0"/>
We took this decision as we maintain our own cache of search results using memcached. This reduced our commit time, but it was still causing us problems when we optimised; running somewhere around 14mins. So we decided to forgo some query performance and increase the maximum number of segments that our index could have by using the optional maxsegments attribute when calling optimize.

<optimize maxSegments="5" />
We also removed the Dismax search & partitioned search requesthandlers as they were never used and removed the listeners from the newSearcher and firstSearcher events as we didn't need to warm the cache of these searchers any more. The next step is to upgrade to Solr 1.4 (from 1.3) in a master slave configuration so that we can have a dedicated write instance and allow ourselves to scale horizontally as we move more towards using Solr rather than our RDBMS.

Tag: 
Solr
anna.siegel@7digital.com
Wednesday, May 11, 2016 - 04:20

Today marks the beginning of the Technical Academy Tour as Academy Coordinator, Miles Pool, VP Technology, Paul Shannon and later, former apprentice, Mia Filisch head out across the UK to talk about our Technical Academy.

 

Continuous learning has always been part of the culture at 7digital and the Technical Academy allowed us to focus those ideas and start hiring apprentices. Changing the team entry requirements and providing a defined period of training allowed us to attract people from more diverse backgrounds and has increased the proportion of female developers in our team; it’s also strengthened the culture of learning and knowledge sharing at every level.

Emma-Ashley Liles
Monday, April 4, 2016 - 13:48

Since I started at 7digital I’ve loved our belief in continuous improvement. Throughout our history as a company we have had a number of influential women working in various parts of organisation yet I knew there was more we could do to improve the diversity of our tech team.

 

Anonymous
Tuesday, February 16, 2016 - 18:30

Here at 7digital, we see the relationship between the customer and the developer as one of the most important aspects of software development. We treat software development as more of a craft than an engineering discipline. Craftsmen back in the day would have constant communication with their customers, receiving regular visits from their customer to discuss progress and alterations as the item takes shape.

 

Over the last twenty years, the agile software movement and extreme programming in particular has championed this with its short iterations, customer showcases and active customer participation in the creation of features.

 

Mia.Filisch
Tuesday, December 1, 2015 - 20:10

7digital software developer Mia Filisch attended the October 28th Velocity conference in Amsterdam. She was kind enough to share her account of the core takeaways here with us. She found that the core recurring theme around security was enough to inspire some internal knowledge sharing sessions she has already started scheming on. The diversity of insights led to a productive and informative conference. See below for her notes.

 

Key takeaways from specific sessions: