For one reason or the other, I have often had a need to find matches in large data sets, usually containing significant quantities of text. One example was a 100k chapters of stories (totaling about 1.5Gb, unformatted) that I wanted to be able to search. While this is quite small when compared to even the content of a single notable site, it still forms the starting point for searches of this type.
Full-text searching is provided to varying degrees by a number of platforms:
- MySQL – does support queries against text columns, if they are indexed as full text, but the speed of searches is quite slow, the resource usage is high, and performance decreases as the size of the database increases. Given these shortcomings, I have never really pursued the finer details of this platform.
- Lucene – Good scalability and performance, however, when I started the above mentioned project, I hadn’t heard of it. My later, cursory, examination of it, did not suggest any reasons to switch platforms.
- Sphinx – this is a daemon written in C++ that provides excellent speed and search options. Results are usually quite accurate, and it scales well. It comes with a API for PHP (my language of choice) and has, so far, met all my needs.
Using Sphinx, one first builds an index (as specified by their configuration file) – it is possible to use a stopwords file (either manually created, or the indexer will list high frequency words for you). It took a fairly short time (< 2-3 min) to index 1.5Gb (I might suggest that the bottleneck was actually pulling the results from the database, but I never bothered to verify that). Searches on this index usually took about 1-2ms to complete.
The above is a fairly typical example of full text searching – however, due to its speed, and search features, I have used this on shorter strings such as filenames on occasion. Today, I had a list of about 2000 titles that I wanted to check against filenames on my computer – a quick script to insert these into a database (which I already had on hand), and a modification to a sphinx config file and I could run the indexer. That took about 1 second to run.
My quick little script then let me input as many search terms as wanted, at one per line, and would run each against the index – 2000 searches took about 30 seconds to run. With all the results displayed on a single page, it was a simple matter to browse through and see which ones I didn’t have (not all were exact matches, otherwise, a more effective method could have just produced the final list).
While I have used this on custom built forums and for searching sites in the past, the real-time aspect was previously constructed by using a main+delta scheme, where the bulk of the index would be created, and small, incremental updates would be added periodically in a second (delta) index. The latest version of sphinx (1.1.0 beta) supports a new index type for real-time search capabilities.
Given that Sphinx can also be integrated into MySQL as a database type, as well as its ease of use, I definitely am looking forward to making some more use of it in the future.