Archive for the ‘ Sphinx ’ Category

Sphinx Search Installation

Sphinx search introduction

507110_binocular

After reading my introduction to full text search or you have read article somewhere else and decided to go with full text search in your next project, but you still confuse what full text search engine to use. One implementation of full text search engine is Sphinx. And I’ll give you a short course on how you installing Sphinx for your full text search engine.

Sphinx is a full-text search engine, distributed under GPL version 2. It is not only fast in searching but it is also fast in indexing your data. Currently, Sphinx API has binding in PHP, Python, Perl, Ruby and Java.

Sphinx features

  • high indexing speed (upto 10 MB/sec on modern CPUs);
  • high search speed (avg query is under 0.1 sec on 2-4 GB text collections);
  • high scalability (upto 100 GB of text, upto 100 M documents on a single CPU);
  • provides good relevance ranking through combination of phrase proximity ranking and statistical (BM25) ranking;
  • provides distributed searching capabilities;
  • provides document exceprts generation;
  • provides searching from within MySQL through pluggable storage engine;
  • supports boolean, phrase, and word proximity queries;
  • supports multiple full-text fields per document (upto 32 by default);
  • supports multiple additional attributes per document (ie. groups, timestamps, etc);
  • supports stopwords;
  • supports both single-byte encodings and UTF-8;
  • supports English stemming, Russian stemming, and Soundex for morphology;
  • supports MySQL natively (MyISAM and InnoDB tables are both supported);
  • supports PostgreSQL natively.

There you go, so fire up your terminal or console, and let’s get thing done.

Installing sphinxsearch

  1. Download sphinx at sphinxsearch.com, for this tutorial, I use Sphinx 0.9.8.1
    $wget http://sphinxsearch.com/downloads/sphinx-0.9.8.1.tar.gz
  2. Open your terminal, extract and install sphinx
    $tar -xvf sphinx-0.9.8.1.tar.gz
  3. sphinx need mysql-dev install, if you use ubuntu linux install this
    $sudo apt get install libmysqlclient15-dev
  4. Install sphinx to your system
    $cd sphinx-0.9.8.1/
    $./configure
    $make
    $sudo make install

    Note if you want to use sphinx with PostgreSQL, configure with this argument –with-pgsql

    $./configure --with-pgsql
  5. Test your installation

    $search

    This should come up in your terminal

    Sphinx 0.9.8.1-release (r1533)
    Copyright (c) 2001-2008, Andrew Aksyonoff
    
    Usage: search [OPTIONS] [word2 [word3 [...]]]
    
    Options are:
    -c, --config 	use given config file instead of defaults
    -i, --index 	search given index only (default: all indexes)
    -a, --any		match any query word (default: match all words)
    -b, --boolean		match in boolean mode
    -p, --phrase		match exact phrase
    -e, --extended		match in extended mode
    -f, --filter  	only match if attribute attr value is v
    -s, --sortby 	sort matches by 'CLAUSE' in sort_extended mode
    -S, --sortexpr 	sort matches by 'EXPR' DESC in sort_expr mode
    -o, --offset 	print matches starting from this offset (default: 0)
    -l, --limit 	print this many matches (default: 20)
    -q, --noinfo		dont print document info from SQL database
    -g, --group 	group by attribute named attr
    -gs,--groupsort 	sort groups by
    --sort=date		sort by date, descending
    --rsort=date		sort by date, ascending
    --sort=ts		sort by time segments
    --stdin			read query from stdin
    
    This program (CLI search) is for testing and debugging purposes only;
    it is NOT intended for production use.

Well done. You have Sphinx at your service. But before you can play with this full text search engine you have just installed, you have to understand how Sphinx works.

Sphinx installed 4 program in your environment, but most of the time we will only use indexer, search and searchd. To begin with, we have to create an index for our source. Let’s create a file name sphinx.conf, and here is a sample of sphinx.conf look like.

source book
{
    type            = mysql
    sql_host        = localhost
    sql_user        = root
    sql_pass        = root
    sql_db          = library
    sql_port        = 3306# optional, default is 3306
    sql_query       = SELECT id, title, summary, author from library
    sql_query_info  = SELECT * FROM library_book WHERE id=$id
}

index book
{
    source          = book
    path            = data/book
    docinfo         = extern
    charset_type    = sbcs
}

indexer
{
    mem_limit       = 32M
}

searchd
{
    port            = 3312
    log             = log/searchd.log
    query_log       = log/query.log
    read_timeout    = 5
    max_children    = 30
    pid_file        = log/searchd.pid
    max_matches     = 1000
}

For more information about sphinx configuration, please go to sphinx documentation.

Create log folder for our searchd log file and another folder named data for our index data. Run indexer to index our database.

$mkdir log
$mkdir data
$indexer --all
Sphinx 0.9.8.1-release(r1533)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file ./sphinx.conf'...
indexing index 'book'...
collected 12 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 12 docs, 10319 bytes
total 0.018 sec, 571436.48 bytes/sec, 664.53 docs/sec

You can use search program to test search index you have just created. Assuming you have book with title contain PHP in your database, then run search PHP will give you some results.

$search PHP

Sphinx Search Engine Performance

The following is a summary of some real-world data collected from the Sphinx query logs on a cluster of 15 servers. Each server runs its own copy of Sphinx, Apache, a busy web application, MySQL and miscellaneous services.

The dataset contains 453 million query log instances from 180 Sphinx indexes, collected over several months, using Sphinx version 0.9.8 on Linux kernel 2.6.18. The servers are all Dell PowerEdge 1950 with Quad Core Intel® Xeon® E5335, 2×4MB Cache, 2.0GHz, 1333MHz FSB, SATA drives, 7200rpm.

Keep in mind, though, that this is real world data and not a controlled test. This is how Sphinx performed in our environment, for the particular way we use Sphinx.

The graph below displays the response time distribution for all servers and all indexes, and shows, for example, that 60% of queries complete within 0.01 secs, 80% within 0.1 secs and 99% within 0.5 secs. Response times tend to occur in 3 bands (corresponding to the peaks in the frequency graph) – <0.001 sec, 0.03 sec and 0.3secs, which partly relates to the number of disk accesses required to fulfil a request. At 0.001 sec, all data is in memory, while at 0.3 secs, several disk accesses are occurring. Whilst the middle peak is not so obvious in this graph, the per-server or per-index graphs often have different distributions but still tend to have peaks at one or more of these three bands.
Sphinx Query Response Times Total for all servers, all indexes

The next observation is that query word count affects performance, but not necessarily in proportion to the number of query words, as shown in the graph below. 1-4 word queries consistently offer best performance. The 6-50 words range is consistently the slowest, most likely because the chance of finding documents with multiple matches is high so there is extra ranking effort involved. Above 50, there is presumably a higher chance of having words with few matches, which speeds up the ranking process.
Sphinx Query Response Time by Query Word Count

Finally, we see that the size of the inverted index (.spd files) also affects performance. The three graphs below show how the response time distribution tends to move to the right as the index size increases. The larger the index, the higher the chance that data will need to be re-read from disk (rather than from Sphinx-internal or system buffers/cache), hence this is not unexpected.
Sphinx Query Response Times for Index Sizes 1MB - 3MB
Sphinx Query Response Times for Index Sizes 3MB - 30MBSphinx Query Response Times for Index Sizes >30MB

Here is a PDF summary of Sphinx performance for this dataset, including many additional graphs of the data by server and by index.

Install Sphinx on Ubuntu

apt-get update
apt-get install libmysql++-dev make gcc+ g++

Go to a directory of your choosing

Download The Sphinx Binary (latest as of this writing)
http://sphinxsearch.com/downloads/sphinx-0.9.8.1.tar.gz

tar zvxf sphinx-0.9.8.1.tar.gz

cd sphinx-0.9.8.1

./configure

make

make install (you will need superuser rights for this last step)

The binaries will be install in /usr/local/bin

/usr/bin/install -c ‘indexer’ ‘/usr/local/bin/indexer’
/usr/bin/install -c ‘searchd’ ‘/usr/local/bin/searchd’
/usr/bin/install -c ‘search’ ‘/usr/local/bin/search’
/usr/bin/install -c ‘spelldump’ ‘/usr/local/bin/spelldump’

some config files will drop into /usr/local/etc

/usr/bin/install -c -m 644 ‘sphinx.conf.dist’ ‘/usr/local/etc/sphinx.conf.dist’
/usr/bin/install -c -m 644 ‘sphinx-min.conf.dist’ ‘/usr/local/etc/sphinx-min.conf.dist’

or

tar xvzf sphinx-0.9.8.1.tar.gz
cd sphinx-0.9.8.1/
./configure --with-mysql-includes=/usr/include/mysql --with-mysql-libs=/usr/lib/mysql

 Make and Install Sphinx Run the standard linux commands to install Sphinx.
make
sudo make install
Bookmark and Share

MVA (multi-valued attributes)

MVAs, or multi-valued attributes, are an important special type of per-document attributes in Sphinx. MVAs make it possible to attach lists of values to every document. They are useful for article tags, product categories, etc. Filtering and group-by (but not sorting) on MVA attributes is supported.

Currently, MVA list entries are limited to unsigned 32-bit integers. The list length is not limited, you can have an arbitrary number of values attached to each document as long as RAM permits (.spm file that contains the MVA values will be precached in RAM by searchd). The source data can be taken either from a separate query, or from a document field; see source type in sql_attr_multi. In the first case the query will have to return pairs of document ID and MVA values, in the second one the field will be parsed for integer values. There are absolutely no requirements as to incoming data order; the values will be automatically grouped by document ID (and internally sorted within the same ID) during indexing anyway.

When filtering, a document will match the filter on MVA attribute if any of the values satisfy the filtering condition. (Therefore, documents that pass through exclude filters will not contain any of the forbidden values.) When grouping by MVA attribute, a document will contribute to as many groups as there are different MVA values associated with that document. For instance, if the collection contains exactly 1 document having a ‘tag’ MVA with values 5, 7, and 11, grouping on ‘tag’ will produce 3 groups with ‘@count’ equal to 1 and ‘@groupby’ key values of 5, 7, and 11 respectively. Also note that grouping by MVA might lead to duplicate documents in the result set: because each document can participate in many groups, it can be chosen as the best one in in more than one group, leading to duplicate IDs. PHP API historically uses ordered hash on the document ID for the resulting rows; so you’ll also need to use SetArrayResult() in order to employ group-by on MVA with PHP API.

sql_attr_multi

Multi-valued attribute (MVA) declaration. Multi-value (ie. there may be more than one such attribute declared), optional. Applies to SQL source types (mysql, pgsql, mssql) only.

Plain attributes only allow to attach 1 value per each document. However, there are cases (such as tags or categories) when it is desired to attach multiple values of the same attribute and be able to apply filtering or grouping to value lists.

The declaration format is as follows (backslashes are for clarity only; everything can be declared in a single line as well):

sql_attr_multi = ATTR-TYPE ATTR-NAME 'from' SOURCE-TYPE \
	[;QUERY] \
	[;RANGE-QUERY]

where

  • ATTR-TYPE is ‘uint’ or ‘timestamp’
  • SOURCE-TYPE is ‘field’, ‘query’, or ‘ranged-query’
  • QUERY is SQL query used to fetch all ( docid, attrvalue ) pairs
  • RANGE-QUERY is SQL query used to fetch min and max ID values, similar to ‘sql_query_range’

 

Example:
sql_attr_multi = uint tag from query; SELECT id, tag FROM tags
sql_attr_multi = uint tag from ranged-query; \
	SELECT id, tag FROM tags WHERE id>=$start AND id<=$end; \
	SELECT MIN(id), MAX(id) FROM tags



SetArrayResult

Prototype: function SetArrayResult ( $arrayresult ) PHP specific. Controls matches format in the search results set (whether matches should be returned as an array or a hash). $arrayresult argument must be boolean. If $arrayresult is false (the default mode), matches will returned in PHP hash format with document IDs as keys, and other information (weight, attributes) as values. If $arrayresult is true, matches will be returned as a plain array with complete per-match information including document ID. Introduced along with GROUP BY support on MVA attributes. Group-by-MVA result sets may contain duplicate document IDs. Thus they need to be returned as plain arrays, because hashes will only keep one entry per document ID.

Install Sphinx Search in MediaWikis

As MediaWiki-based site administrators, one of the most common complaints we receive is that the default search engine is far from excellent. In our day and age where Google sets the standard for search engine capabilities, users aren’t happy with a basic search engine. They need, or should I say demand a faster, easier, better engine.
The Sphinx Search Engine seems to promise exactly that; a full text search engine that is both flexible and fast. This extension incorporates the Sphinx engine into MediaWiki to provide a better alternative for searching. The extension can be installed in one of two modes:

  1. Provide an additional Special Page for searching using Sphinx. This method is excellent for providing a method for evaluating the performance of the extension while still maintaining the default search engine.
  2. Completely replace the built in search engine with the sphinx search engine.

This extension is very similar in nature to Extension:LuceneSearch and Extension:Hyper Estraier. The main difference is obviously the search engine backend. SphinxSearch extension also adds some additional features like “Did You Mean” suggestions for misspelled searches. This functionality is fundamentally different from Extension:DidYouMean which only suggests alternate article names for existing articles. Also, SphinxSearch can be easily evaluated before rolling it out as a complete replacement search engine.

Compatibility

This extension has been shown to work / not work with the following MediaWiki versions. Please add more successes and failures to this list

  • 1.6.? — Fails – The guy who tested old version of Mediawiki – (Pessoft)
  • 1.8.? — Fails – The guy who uses old version of Mediawiki – (125.17.142.146)
  • 1.9.3 – Works – (Gri6507)
  • 1.10 – Works – (Svemir)
  • 1.11 – Works – (125.17.142.146, Svemir)
  • 1.12 – Works – (80.152.175.189 (Windows/IIS), Svemir)
  • 1.13.0 – Works – (Erik Gregg), Thanks guys! Nice Job! It works on Wikipedia!
  • 1.13.2 – Works – (Jipipayo)
  • 1.13.3 – Works – (RADION Openlab), Kamil Wencel, thanks works well on our new testsite LAMP + sphinx 0.9.8.1
  • 1.14 – Works130.234.189.190 12:47, 24 February 2009 (UTC), works great on our WIMP
  • 1.15 – Works – works great on CentOS 5 LAMP server

The extension has been shown to work / not work with the following Sphinx versions. Please add more successes and failures to this list

  • 0.9.6rc1 – Does not work – (125.17.142.146)
  • 0.9.7 – Works – (Gri6507)
  • 0.9.8svn – Works – (Svemir)
  • 0.9.8svn-r1112 (Jan 28, 2008 snapshot) – Does not work for 130.234.189.190, but it works for Svemir
  • 0.9.8-rc2 r1234 – Works – (Gmoyle, 80.152.175.189 (Windows/IIS))
  • 0.9.8 – Works – (Svemir)
  • 0.9.8.1 – Works130.234.189.190 12:47, 24 February 2009 (UTC), works great on our WIMP

The extension has been shown to work / not work with the following languages. Main problem may be that it cannot separate the words and the phrases.

  • English – Works – all versions – (Alpha3)
  • Chinese – Works on win2003 wamp 1.7.3 – all versions – (Alpha3)
  • Please see this post in Sphinx forums for details.
  • German – Works on W2k3 and IIS – (80.152.175.189)
  • Russian – Works (XAMPP, Debian) – StasFomin.
  • Hebrew – Works on W2K3 and IIS – CrushKing.

Installation Instructions

Step 1 – Install Sphinx

Download Sphinx Search Engine. Follow the installation instructions. You only need to do the actual installation, which means you do not need to do the “Quick Sphinx usage tour”. You can verify your installation by following the rest of the steps here. Note: if installing on a Windows server, you do not need to compile anything; just download the Win32 release binaries.

Step 2 – Configure Sphinx

Download and extract the extension to a temporary directory. Copy the sphinx.conf file from this download to some directory (we will refer to this file as “/path/to/sphinx.conf” below.) This directory should not be web-accessible, so you should not use the extensions folder. Make sure to adjust all values to suit your setup:

  • Set correct database, username, and password for your MediaWiki database
  • Update table names in SQL queries if your MediaWiki installation uses a prefix (backslash line breaks may need to be removed if the indexer step below fails)
  • Update the file paths (/var/data/sphinx/…, /var/log/sphinx/…) and create folders as necessary
  • If your wiki is very large, you may want to consider specifying a query range in the conf file.
  • If your wiki is not in English, you will need to change (or remove) the morphology attribute.

Note: To give credit where credit is due, we must thank the author of this excellent article for providing an excellent starting point on configuring this file.

Step 3 – Run Sphinx Indexer

Run the sphinx indexer to prepare for searching:

/path/to/sphinx/installation/indexer --config /path/to/sphinx.conf --all

Once again, make sure to replace the paths to match your installation. This process is actually pretty fast, but clearly depends on how large your wiki is. Just be patient and watch the screen for updates.

Step 4 – Test Out Sphinx

When the indexer is finished, test that sphinx searching is actually working:

/path/to/sphinx/installation/search --config /path/to/sphinx.conf "search string"

You will see the result stats immediately (Sphinx is FAST.) Note that the article data you see at this point comes from the sql_query_info in sphinx.conf file. In the extension we can get to the actual article content because we have text old_id available as an extra attribute. It would be slow to fetch article content on the command line (we would have to join page, revision, and text tables,) so we just fetch page_title and page_namespace at this point.

Step 5 – Start Sphinx Daemon

In order to speed up the searching capability for the wiki, we must run the sphinx in daemon mode. Add the following to whatever server startup script you have access (i.e. /etc/rc.local):

/path/to/sphinx/installation/searchd --config /path/to/sphinx.conf &

Note: without the daemon running, searching will not work. That is why it is critical to make sure the daemon process is started every time the server is restarted.

Step 6 – Configure Incremental Updates

To keep the index for the search engine up to date, the indexer must be scheduled to run at a regular interval. On most UNIX systems edit your crontab file by running the command:

crontab -e

Add this line to set up a cron job for the full index – for example once every night:

0 3 * * * /path/to/sphinx/installation/indexer --quiet --config /path/to/sphinx.conf wiki_main --rotate >/dev/null 2>&1

Add this line to set up a more frequent cron to update the smaller index regularly:

0 9,15,21 * * * /path/to/sphinx/installation/indexer --quiet --config /path/to/sphinx.conf wiki_incremental --rotate >/dev/null 2>&1

As before, make sure to adjust the paths to suit your configuration. Note that –rotate option is needed if searchd deamon is already running, so that the indexer does not modify the index file while it is being used. It creates a new file and copies it over the existing one when it is done.

Step 7 – Extension Preparation – Sphinx PHP API

Create extensions/SphinxSearch directory and copy the Sphinx API file, sphinxapi.php there. This file is part of the sphinx source code, under the api/ directory. Note: if you installed Sphinx from a Win32 binary release, it may not have come with a copy of sphinxapi.php. You must download either the source code package or an API update package. Just use your favorite uncompress utility (i.e. WinZip) and extract only the sphinxapi.php to the extensions directory; the other files can be ignored.

Step 8 – Extension Preparation – Mediawiki Extension Functions

Download ExtensionFunctions.php from SVN and copy it to your extensions/SphinxSearch directory.

 svn export http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/ExtensionFunctions.php

Step 9 – Extension Installation – PHP Files

Copy all remaining files (SphinxSearch.php, SphinxSearch_body.php, SphinxSearch.js, SphinxSearch_PersonalDict.php, SphinxSearch_spell.php, spinner.gif) from the temporary directory you extracted the code to in #Step 2 to your extensions/SphinxSearch directory.

Step 10 – Extension Installation – Local Settings

Add the following text to your LocalSettings.php

require_once( "$IP/extensions/SphinxSearch/SphinxSearch.php" );

Configuration

Options

For the most part, the extension's default options do not need any modification. However, if tweaking is needed/desired, there are a number of configuration options that could be configured from LocalSettings.php or from SphinxSearch.php directly. Those are:
  • $wgSphinxSearch_host - the hostname on which sphinx's searchd daemon is running (default to localhost)
  • $wgSphinxSearch_port - the port number on which sphinx's searchd daemon is running (default to 3312)
  • $wgSphinxSearch_mode - the Sphinx search mode. The default mode is the most intuitive. See Sphinx documentation for other valid options.
  • $wgSphinxSearch_matches - the number of search hits to display per result page.
  • $wgSphinxSearch_weights - the way Sphinx orders the results. The default is pretty good. See Sphinx documentation for other valid options.
  • $wgSphinxSearch_groupby, $wgSphinxSearch_groupsort - define how to group the results. See Sphinx documentation for other valid options.
  • $wgSphinxSearch_sortby - set matches sorting mode (default to SPH_SORT_RELEVANCE). See Sphinx documentation for other valid options.
When setting these options in LocalSettings.php, make sure to do so after the call to require_once() for this extension.

Mode Of Operation

By default, this extension will run so as not to overwrite the built-in search engine, but instead provide a new Special Page called Search Wiki Using Sphinx. This allows the users to evaluate this extension by directory comparing the search results of the built-in search vs. Sphinx search. If the performance is deemed acceptable to replace the built-in search engine, this extension can easily be configured to act as the default search engine. To do so, modify SphinxSearch.php (not LocalSettings.php) to uncomment the lines containing
$wgDisableInternalSearch = true;
$wgDisableSearchUpdate = true;
$wgSearchType = 'SphinxSearch';

Now, the standard search method will use Sphinx by default. Note: when used in this way, the extension preserves the functionality of the Go and Search buttons.

Did You Mean

When performing a search and the search query is misspelled, the search results could be greatly impaired. Without knowing about the misspelling, it may take the user a while to figure out why their search results are not very good. That is why this extension has an optional "Did You Mean" support. When enabled, this feature will suggest a properly spelled search query for the user in case of a spelling mistake. Also, since many wikis utilize their own jargon, in order to make the "Did You Mean" suggestions more reasonable, this extension can optionally utilize a personalized dictionary.

The spell checking capability is provided via one of two methods.

  1. Aspell - a command line program for performing spell checking
  2. Pspell - PHP native interface to aspell

If you are using ubuntu or debian , you could easy install pspell in 2 steps :

 sudo apt-get install php5-pspell

Then restart your apache:

sudo /etc/init.d/apache2 restart

The Did You Mean feature is turned off because it requires the presence of a spell checker and some configuration. In order to enable this functionality edit the SphinxSearch.php to uncomment the line containing

$wgSphinxSuggestMode = true;

This will automatically pick whichever method for interactive with the spellchecker utility is more efficient. If your wiki server does not have Pspell support, then specify the path to the Aspell executable by editing the line containing

$wgSphinxSearchAspellPath = "/usr/bin/aspell";

If for whatever reason the Aspell dictionary files on the server are not in the default location, you can specify the proper path to the dictionary files by setting

$wgSphinxSearchPspellDictionaryDir = "/usr/lib/aspell";

If using a personalized dictionary, edit the line containing

$wgSphinxSearchPersonalDictionary = dirname( __FILE__ ) . "/personal.en.pws";

to point to where you’d like to keep the dictionary file.
When the Did You Mean feature is enabled and is configured to use a personal dictionary file, then the next step is to add contents to this dictionary. SphinxSearch will create a new restricted access special page called Wiki-specific Sphinx search spellcheck dictionary. This page is only accessible by users with DELETE permissions (typically PowerUser and SysOp groups). These users can utilize this page to view the words already in the dictionary, add words into the dictionary, and remove words from the dictionary.

Stop Words

When modifying the sphinx.conf file (see #Step 2 – Configure Sphinx), there is an option for specifying a file containing search stop words. Stop words are those common words like ‘a’ and ‘the’ that appear commonly in text and should really be ignored from searching. A somewhat complete list of English stop words can be found here. Simply copy those words into a text file, and modify your sphinx.conf to point to that file with

stopwords = /path/to/stopwords.txt

ToDo

  • Use auto-load and make other performance improvements.
  • Add “ignore” checkbox to category search (so only articles that do not have that category are returned.)
  • Additional search options (exact match, etc.)
  • Add image thumbnails to image matches.
  • Smarter handling of multiple Sphinx index files.
  • Assign weights to namespaces.
  • Sort the results in SPH_SORT_EXTENDED mode by @relevance and by number of times the page has been viewed (available from wiki database). The idea behind this is that given two pages that have the same relevance to the search, if one has been viewed more times, there is probably a reason for that. Number of links to each page could also be included in the calculation.
  • Use existing titles in “did you mean” suggestions.
  • If originally “Go” was clicked, and “did you mean” link results in a direct match, redirect to that page.
  • Easier install of the extension. Perhaps a script?

Completed ToDos

We use SPH_MATCH_EXTENDED for better relevance weights, but we process the search term to make it assume an OR instead of an AND on multiple. This will be replaced with an option on the search form.

  • Add the “did you mean” functionality to the search results.

Sphinx (SQL Phrase Index) Introduction and Installation

Sphinx is a full-text search engine, distributed under GPL version 2. Commercial licensing (eg. for embedded use) is also available upon request.

Generally, it’s a standalone search engine, meant to provide fast, size-efficient and relevant full-text search functions to other applications. Sphinx was specially designed to integrate well with SQL databases and scripting languages.

Currently built-in data source drivers support fetching data either via direct connection to MySQL, or PostgreSQL, or from a pipe in a custom XML format. Adding new drivers (eg. to natively support some other DBMSes) is designed to be as easy as possible.

Search API is natively ported to PHP, Python, Perl, Ruby, Java, and also available as a pluggable MySQL storage engine. API is very lightweight so porting it to new language is known to take a few hours.

As for the name, Sphinx is an acronym which is officially decoded as SQL Phrase Index. Yes, I know about CMU’s Sphinx project.

Sphinx features:

  • high indexing speed (upto 10 MB/sec on modern CPUs);
  • high search speed (avg query is under 0.1 sec on 2-4 GB text collections);
  • high scalability (upto 100 GB of text, upto 100 M documents on a single CPU);
  • provides good relevance ranking through combination of phrase proximity ranking and statistical (BM25) ranking;
  • provides distributed searching capabilities;
  • provides document exceprts generation;
  • provides searching from within MySQL through pluggable storage engine;
  • supports boolean, phrase, and word proximity queries;
  • supports multiple full-text fields per document (upto 32 by default);
  • supports multiple additional attributes per document (ie. groups, timestamps, etc);
  • supports stopwords;
  • supports both single-byte encodings and UTF-8;
  • supports English stemming, Russian stemming, and Soundex for morphology;
  • supports MySQL natively (MyISAM and InnoDB tables are both supported);
  • supports PostgreSQL natively.

Where to get Sphinx

Sphinx is available through its official Web site at http://www.sphinxsearch.com/.

Currently, Sphinx distribution tarball includes the following software:

  • indexer: an utility which creates fulltext indexes;
  • search: a simple command-line (CLI) test utility which searches through fulltext indexes;
  • searchd: a daemon which enables external software (eg. Web applications) to search through fulltext indexes;
  • sphinxapi: a set of searchd client API libraries for popular Web scripting languages (PHP, Python, Perl, Ruby).
  • spelldump: a simple command-line tool to extract the items from an ispell or MySpell (as bundled with OpenOffice) format dictionary to help customize your index, for use with wordforms.
  • indextool: an utility to dump miscellaneous debug information about the index, added in version 0.9.9-rc2.

Installation

Supported systems

Most modern UNIX systems with a C++ compiler should be able to compile and run Sphinx without any modifications.

Currently known systems Sphinx has been successfully running on are:

  • Linux 2.4.x, 2.6.x (various distributions)
  • Windows 2000, XP
  • FreeBSD 4.x, 5.x, 6.x
  • NetBSD 1.6, 3.0
  • Solaris 9, 11
  • Mac OS X

CPU architectures known to work include X86, X86-64, SPARC64.

I hope Sphinx will work on other Unix platforms as well. If the platform you run Sphinx on is not in this list, please do report it.

At the moment, Windows version of Sphinx is not intended to be used in production, but rather for testing and debugging only. Two most prominent issues are missing concurrent queries support (client queries are stacked on TCP connection level instead), and missing index data rotation support. There are succesful production installations which workaround these issues. However, running high-volume search service under Windows is still not recommended.

Required tools

On UNIX, you will need the following tools to build and install Sphinx:

  • a working C++ compiler. GNU gcc is known to work.
  • a good make program. GNU make is known to work.

On Windows, you will need Microsoft Visual C/C++ Studio .NET 2003 or 2005. Other compilers/environments will probably work as well, but for the time being, you will have to build makefile (or other environment specific project files) manually.

Installing Sphinx on Linux

  1. Extract everything from the distribution tarball (haven’t you already?) and go to the sphinx subdirectory:

    $ tar xzvf sphinx-0.9.8.tar.gz
    $ cd sphinx

  2. Run the configuration program:

    $ ./configure

    There’s a number of options to configure. The complete listing may be obtained by using --help switch. The most important ones are:

    • --prefix, which specifies where to install Sphinx; such as --prefix=/usr/local/sphinx (all of the examples use this prefix)
    • --with-mysql, which specifies where to look for MySQL include and library files, if auto-detection fails;
    • --with-pgsql, which specifies where to look for PostgreSQL include and library files.
  3. Build the binaries:

    $ make

  4. Install the binaries in the directory of your choice: (defaults to /usr/local/bin/ on *nix systems, but is overridden with configure --prefix)

    $ make install

Installing Sphinx on Windows

Installing Sphinx on a Windows server is often easier than installing on a Linux environment; unless you are preparing code patches, you can use the pre-compiled binary files from the Downloads area on the website.

  1. Extract everything from the .zip file you have downloaded – sphinx-0.9.8-win32.zip (or sphinx-0.9.8-win32-pgsql.zip if you need PostgresSQL support as well.) You can use Windows Explorer in Windows XP and up to extract the files, or a freeware package like 7Zip to open the archive.For the remainder of this guide, we will assume that the folders are unzipped into C:\Sphinx, such that searchd.exe can be found in C:\Sphinx\bin\searchd.exe. If you decide to use any different location for the folders or configuration file, please change it accordingly.
  2. Install the searchd system as a Windows service:C:\Sphinx> C:\Sphinx\searchd --install --config C:\Sphinx\sphinx.conf --servicename SphinxSearch
  3. The searchd service will now be listed in the Services panel within the Management Console, available from Administrative Tools. It will not have been started, as you will need to configure it and build your indexes with indexer before starting the service. A guide to do this can be found under Quick tour.

Known installation issues

If configure fails to locate MySQL headers and/or libraries, try checking for and installing mysql-devel package. On some systems, it is not installed by default.

If make fails with a message which look like

/bin/sh: g++: command not found
make[1]: *** [libsphinx_a-sphinx.o] Error 127

try checking for and installing gcc-c++ package.

If you are getting compile-time errors which look like

sphinx.cpp:67: error: invalid application of `sizeof' to
    incomplete type `Private::SizeError<false>'

this means that some compile-time type size check failed. The most probable reason is that off_t type is less than 64-bit on your system. As a quick hack, you can edit sphinx.h and replace off_t with DWORD in a typedef for SphOffset_t, but note that this will prohibit you from using full-text indexes larger than 2 GB. Even if the hack helps, please report such issues, providing the exact error message and compiler/OS details, so I could properly fix them in next releases.

If you keep getting any other error, or the suggestions above do not seem to help you, please don’t hesitate to contact me.

Quick Sphinx usage tour

All the example commands below assume that you installed Sphinx in /usr/local/sphinx, so searchd can be found in /usr/local/sphinx/bin/searchd.

To use Sphinx, you will need to:

  1. Create a configuration file.Default configuration file name is sphinx.conf. All Sphinx programs look for this file in current working directory by default.

    Sample configuration file, sphinx.conf.dist, which has all the options documented, is created by configure. Copy and edit that sample file to make your own configuration: (assuming Sphinx is installed into /usr/local/sphinx/)

    $ cd /usr/local/sphinx/etc
    $ cp sphinx.conf.dist sphinx.conf
    $ vi sphinx.conf

    Sample configuration file is setup to index documents table from MySQL database test; so there’s example.sql sample data file to populate that table with a few documents for testing purposes:

    $ mysql -u test < /usr/local/sphinx/etc/example.sql

  2. Run the indexer to create full-text index from your data:

    $ cd /usr/local/sphinx/etc
    $ /usr/local/sphinx/bin/indexer

  3. Query your newly created index!

To query the index from command line, use search utility:

$ cd /usr/local/sphinx/etc
$ /usr/local/sphinx/bin/search test

To query the index from your PHP scripts, you need to:

  1. Run the search daemon which your script will talk to:

    $ cd /usr/local/sphinx/etc
    $ /usr/local/sphinx/bin/searchd

  2. Run the attached PHP API test script (to ensure that the daemon was succesfully started and is ready to serve the queries):

    $ cd sphinx/api
    $ php test.php test

  3. Include the API (it’s located in api/sphinxapi.php) into your own scripts and use it.

Happy searching!

Help Links :

komunitasweb.com/2009/03/sphinxsearch-introduction/

http://www.sphinxsearch.com/docs/current.html#intro

http://www.mysqlperformanceblog.com/?s=sphinx+tutorial+mysql –

wordpress.org/extend/plugins/search/other_notes/


Documentation

Sphinx 0.9.9 reference manual

Table of Contents

1. Introduction
1.1. About
1.2. Sphinx features
1.3. Where to get Sphinx
1.4. License
1.5. Author and contributors
1.6. History
2. Installation
2.1. Supported systems
2.2. Required tools
2.3. Installing Sphinx on Linux
2.4. Installing Sphinx on Windows
2.5. Known installation issues
2.6. Quick Sphinx usage tour
3. Indexing
3.1. Data sources
3.2. Attributes
3.3. MVA (multi-valued attributes)
3.4. Indexes
3.5. Restrictions on the source data
3.6. Charsets, case folding, and translation tables
3.7. SQL data sources (MySQL, PostgreSQL)
3.8. xmlpipe data source
3.9. xmlpipe2 data source
3.10. Live index updates
3.11. Index merging
4. Searching
4.1. Matching modes
4.2. Boolean query syntax
4.3. Extended query syntax
4.4. Weighting
4.5. Sorting modes
4.6. Grouping (clustering) search results
4.7. Distributed searching
4.8. searchd query log format
4.9. MySQL protocol support and SphinxQL
5. Command line tools reference
5.1. indexer command reference
5.2. searchd command reference
5.3. search command reference
5.4. spelldump command reference
5.5. indextool command reference
6. API reference
6.1. General API functions
6.1.1. GetLastError
6.1.2. GetLastWarning
6.1.3. SetServer
6.1.4. SetRetries
6.1.5. SetConnectTimeout
6.1.6. SetArrayResult
6.1.7. IsConnectError
6.2. General query settings
6.2.1. SetLimits
6.2.2. SetMaxQueryTime
6.2.3. SetOverride
6.2.4. SetSelect
6.3. Full-text search query settings
6.3.1. SetMatchMode
6.3.2. SetRankingMode
6.3.3. SetSortMode
6.3.4. SetWeights
6.3.5. SetFieldWeights
6.3.6. SetIndexWeights
6.4. Result set filtering settings
6.4.1. SetIDRange
6.4.2. SetFilter
6.4.3. SetFilterRange
6.4.4. SetFilterFloatRange
6.4.5. SetGeoAnchor
6.5. GROUP BY settings
6.5.1. SetGroupBy
6.5.2. SetGroupDistinct
6.6. Querying
6.6.1. Query
6.6.2. AddQuery
6.6.3. RunQueries
6.6.4. ResetFilters
6.6.5. ResetGroupBy
6.7. Additional functionality
6.7.1. BuildExcerpts
6.7.2. UpdateAttributes
6.7.3. BuildKeywords
6.7.4. EscapeString
6.7.5. Status
6.8. Persistent connections
6.8.1. Open
6.8.2. Close
7. MySQL storage engine (SphinxSE)
7.1. SphinxSE overview
7.2. Installing SphinxSE
7.2.1. Compiling MySQL 5.0.x with SphinxSE
7.2.2. Compiling MySQL 5.1.x with SphinxSE
7.2.3. Checking SphinxSE installation
7.3. Using SphinxSE
7.4. Building snippets (excerpts) via MySQL
8. Reporting bugs
9. sphinx.conf options reference
9.1. Data source configuration options
9.1.1. type
9.1.2. sql_host
9.1.3. sql_port
9.1.4. sql_user
9.1.5. sql_pass
9.1.6. sql_db
9.1.7. sql_sock
9.1.8. mysql_connect_flags
9.1.9. mysql_ssl_cert, mysql_ssl_key, mysql_ssl_ca
9.1.10. odbc_dsn
9.1.11. sql_query_pre
9.1.12. sql_query
9.1.13. sql_query_range
9.1.14. sql_range_step
9.1.15. sql_query_killlist
9.1.16. sql_attr_uint
9.1.17. sql_attr_bool
9.1.18. sql_attr_bigint
9.1.19. sql_attr_timestamp
9.1.20. sql_attr_str2ordinal
9.1.21. sql_attr_float
9.1.22. sql_attr_multi
9.1.23. sql_query_post
9.1.24. sql_query_post_index
9.1.25. sql_ranged_throttle
9.1.26. sql_query_info
9.1.27. xmlpipe_command
9.1.28. xmlpipe_field
9.1.29. xmlpipe_attr_uint
9.1.30. xmlpipe_attr_bool
9.1.31. xmlpipe_attr_timestamp
9.1.32. xmlpipe_attr_str2ordinal
9.1.33. xmlpipe_attr_float
9.1.34. xmlpipe_attr_multi
9.1.35. xmlpipe_fixup_utf8
9.1.36. mssql_winauth
9.1.37. mssql_unicode
9.1.38. unpack_zlib
9.1.39. unpack_mysqlcompress
9.1.40. unpack_mysqlcompress_maxsize
9.2. Index configuration options
9.2.1. type
9.2.2. source
9.2.3. path
9.2.4. docinfo
9.2.5. mlock
9.2.6. morphology
9.2.7. min_stemming_len
9.2.8. stopwords
9.2.9. wordforms
9.2.10. exceptions
9.2.11. min_word_len
9.2.12. charset_type
9.2.13. charset_table
9.2.14. ignore_chars
9.2.15. min_prefix_len
9.2.16. min_infix_len
9.2.17. prefix_fields
9.2.18. infix_fields
9.2.19. enable_star
9.2.20. ngram_len
9.2.21. ngram_chars
9.2.22. phrase_boundary
9.2.23. phrase_boundary_step
9.2.24. html_strip
9.2.25. html_index_attrs
9.2.26. html_remove_elements
9.2.27. local
9.2.28. agent
9.2.29. agent_blackhole
9.2.30. agent_connect_timeout
9.2.31. agent_query_timeout
9.2.32. preopen
9.2.33. ondisk_dict
9.2.34. inplace_enable
9.2.35. inplace_hit_gap
9.2.36. inplace_docinfo_gap
9.2.37. inplace_reloc_factor
9.2.38. inplace_write_factor
9.2.39. index_exact_words
9.2.40. overshort_step
9.2.41. stopword_step
9.3. indexer program configuration options
9.3.1. mem_limit
9.3.2. max_iops
9.3.3. max_iosize
9.3.4. max_xmlpipe2_field
9.3.5. write_buffer
9.4. searchd program configuration options
9.4.1. listen
9.4.2. address
9.4.3. port
9.4.4. log
9.4.5. query_log
9.4.6. read_timeout
9.4.7. client_timeout
9.4.8. max_children
9.4.9. pid_file
9.4.10. max_matches
9.4.11. seamless_rotate
9.4.12. preopen_indexes
9.4.13. unlink_old
9.4.14. attr_flush_period
9.4.15. ondisk_dict_default
9.4.16. max_packet_size
9.4.17. mva_updates_pool
9.4.18. crash_log_path
9.4.19. max_filters
9.4.20. max_filter_values
9.4.21. listen_backlog
9.4.22. read_buffer
9.4.23. read_unhinted
A. Sphinx revision history

1. Introduction

1.1. About

Sphinx is a full-text search engine, distributed under GPL version 2. Commercial licensing (eg. for embedded use) is also available upon request.

Generally, it’s a standalone search engine, meant to provide fast, size-efficient and relevant full-text search functions to other applications. Sphinx was specially designed to integrate well with SQL databases and scripting languages.

Currently built-in data source drivers support fetching data either via direct connection to MySQL, or PostgreSQL, or from a pipe in a custom XML format. Adding new drivers (eg. to natively support some other DBMSes) is designed to be as easy as possible.

Search API is natively ported to PHP, Python, Perl, Ruby, Java, and also available as a pluggable MySQL storage engine. API is very lightweight so porting it to new language is known to take a few hours.

As for the name, Sphinx is an acronym which is officially decoded as SQL Phrase Index. Yes, I know about CMU’s Sphinx project.

1.2. Sphinx features

  • high indexing speed (upto 10 MB/sec on modern CPUs);
  • high search speed (avg query is under 0.1 sec on 2-4 GB text collections);
  • high scalability (upto 100 GB of text, upto 100 M documents on a single CPU);
  • provides good relevance ranking through combination of phrase proximity ranking and statistical (BM25) ranking;
  • provides distributed searching capabilities;
  • provides document exceprts generation;
  • provides searching from within MySQL through pluggable storage engine;
  • supports boolean, phrase, and word proximity queries;
  • supports multiple full-text fields per document (upto 32 by default);
  • supports multiple additional attributes per document (ie. groups, timestamps, etc);
  • supports stopwords;
  • supports both single-byte encodings and UTF-8;
  • supports English stemming, Russian stemming, and Soundex for morphology;
  • supports MySQL natively (MyISAM and InnoDB tables are both supported);
  • supports PostgreSQL natively.

1.3. Where to get Sphinx

Sphinx is available through its official Web site at http://www.sphinxsearch.com/.

Currently, Sphinx distribution tarball includes the following software:

  • indexer: an utility which creates fulltext indexes;
  • search: a simple command-line (CLI) test utility which searches through fulltext indexes;
  • searchd: a daemon which enables external software (eg. Web applications) to search through fulltext indexes;
  • sphinxapi: a set of searchd client API libraries for popular Web scripting languages (PHP, Python, Perl, Ruby).
  • spelldump: a simple command-line tool to extract the items from an ispell or MySpell (as bundled with OpenOffice) format dictionary to help customize your index, for use with wordforms.
  • indextool: an utility to dump miscellaneous debug information about the index, added in version 0.9.9-rc2.

Bookmark and Share