Netric is an Enterprise Productivity and Collaboration suite that includes advanced tools like CRM, CMS, Project Management and more in a single unified platform. We store a lot of data that is growing exponentially every day. Only 3 months after our beta release our dataset for object records had exceeded 100M records. It did not take long before we started bumping up against major scale limitations with relational databases.
Initially our biggest stress points were all centered around full text search. So long as our SQL queries were using carefully designed and partitioned (mostly numeric) indexes our main database, PostgreSQL 9.1 (which we absolutely love by the way) worked fine; but it began to choke hard as our full text indexes grew beyond the amount of resident memory on the servers. Searches against a large index with 10M+ documents were causing severe lag, and updates and deletions became an even greater bottleneck. We optimized, partitioned tables, sharded datasets and more in an attempt to preserve our target sub-second response time but with each step we found it helped for a period and then we were back at the drawing board only days later.
We decided it was time to integrate a separate service that was specifically designed to index and search on millions of documents in split second time. The solution had to be very flexible because our schema can be user-defined, ridiculously fast, and highly scale-able to billions or even trillions of documents. We researched and evaluated many solutions including Sphinx (what Craigslist uses) and Apache Solr (what pretty much everyone else uses). After much testing we deployed an update that routed full-text queries to Apache Solr. This solution worked really well except for one major limitation - the index updates were nowhere near real-time. We thought we could fix this by forcing a commit with every document update but that pretty much killed the Solr instance as it choked to the point of being almost non-respondent. This presented a major problem because the document needed to be searchable immediately as users updated objects.
We dropped back to PostgreSQL for a while as we searched and tested further. Then we found ElasticSearch and immediately fell in love with the design. The way it partitioned indices was perfect for our multi-tenant application. With Solr it was a pain to handle our incredibly dynamic schema and multi-tenant architecture Initially we were weary of trying ElasticSearc in a production environment because it was a newer product, but with a few tests we quickly discovered that while search performance was on par with Solr, the near-real time indexing was vastly superior. Our stress test was able to index over 10M documents on my local development machine without even breaking a sweat. More importantly, the document were immediately (or close enough for our needs) available after commit without a major performance hit to the index.
Just to demonstrate the kinds of results we got, here are the results of our test. We inserted 10M documents, then ran a series of full text searches while simultaneously inserting 1 document per second. We felt this test would more accurately represent our use case.
The reason for Solr's poor performance is entirely due to the simultaneous inserts. If we stopped the insert service solr beat both ES and PGSQL hands down.
We are not the only ones who came to the conclusion that solr is not desinged for real time indexing. Our friends over at socialcast.com
reported almost exactly the same thing.
Of all the innovation taking place in the "big data" world right now, I believe ElasticSearch is by far one of the most useful and most impressive tools. We are even considering extending its use far beyond a full-text index and using it as the index for our entire application including all lists and queries. We have also started testing possible uses in our Business Intelligence module to render complex analytics on massive amounts of data.
Thanks to the simplicity of deployment and configuration (why can't more products work like that?), we high recommend giving ElasticSearch a try no matter how big or small your project might be. We've even considered using it as a replacement for all persistent storage.