We have recently been working on an incremental indexer for our Solr based search implementation, which was being updated sporadically due to the time it took to perform a complete re-index; it was taking about 5 days to create the 13GB of XML, zip, upload to the server, unzip and then re-index. We have created a Windows service which queries a denormalised data structure using NHibernate. We then use SolrNet to create our Solr documents and push them to the server in batches.
Solr Update Process
Comparing Solr Response Sizes
After seeing some relative success in our Solr implementations xml response times by switching on Tomcats http gzip compression, I've been doing some comparisons between the other formats solr can return. We use Solrnet, an excellent open source .NET Solr client. At the moment, it only supports xml responses, but every request sends the "Accept-encoding:gzip" header as standard, so all you have to do is switch it on on your server and you've got some nicely compressed responses. There is talk of supporting javabin de-serialisation, but it's not there yet. I've decided to compare the following using curl with 1000 rows and 10000 rows in json, javabin, json/gzip compressed and javabin/gzip compressed.
My test setup is a solr 1.4 instance with around 11000 records in sitting behind an nginx reverse proxy handling the gzip compression. As I said, this could easily be achieved by switching on gzip compression in Apache Tomcat. The same 10000 records, returned using the q=*:* directive with wt=json when http gzip compressed is the smallest, but only marginally, compared to wt=javabin. It would seem that json compresses very well indeed. You can also see the massive drop just switching on gzip compression gives to xml.
My conclusion to this would be that because json is a widely accepted content-type, with many well known and fast de-serialising libraries, it would probably be worth implementing that rather than trying to de-serialise javabin. But this was only a quick test and does't take into account how quickly solr handles serialisation of the documents server-side.