Search Indexes

Summary

Cloudant's search is built upon Lucene and allows you to do more ad hoc queries over your data than can be done with primary and secondary indexes.

API Demo

Enhance this tutorial with live data from a sample database inside your Cloudant account.

For security purposes, please sign in or sign up to demo the API.

API Demo

To demo the Cloudant API, you'll need to replicate a small sample database into your account. The database is named animaldb, and it contains information from Wikipedia about ten different animals.


Index functions

Search indexes are defined by a javascript function. This is run over all of your documents, in a similar manner to a view's map function, and defines the fields that your search can query.

A simple search function

function(doc){
  index("name", doc.name);
}

The function takes a single argument, the document, and calls the built-in index function to define an index on the name field.

Field names (the first argument to the index() function) cannot start with an underscore (_). If they do the document will not be indexed.

Values can only be strings, booleans or numbers (specifically 64-bit floating point). Notably, they cannot be objects, arrays, null or undefined, if they are the document will not be indexed.

Similar to views, the functions that define search indexes are stored in design documents, but under the key indexes. Under indexes you define each search index in an object, containing the index function and an optional analyzer. Details on the analyzer are below, the default is standard.

Querying a search index

API DEMO

{
  "_id": "_design/views101",
  "_rev": "12-649b0e71ca89cdad5d66a4e07316726f",
  "indexes": {
    "animals": {
      "index": "function(doc){ index(\"default\", doc._id); }"
    }
  }
}

The API call below hits this search index, called animals, inside the views101 design document. As you can see, we're not specifying a field for the query (we're just using ?q=[query]), so Cloudant uses the default field, which we specified above indexes the document _id. Because animal names are stored in the _id field, the default search index is perfect for name searches, like ?q=kookaburra. Also try a search for "llama" or "elephant". Note, however, that you can always query by id using the special _id field name.

Hit this code with the Cloudant API. The server response will appear directly below.

Sign in or create a free account to demo the Cloudant API.

To demo the API here, replicate the sample database first.

Query
https://[username].cloudant.com/animaldb/_design/views101/_search/animals?q=kookaburra
Test for yourself

Options

The built-in index function takes three arguments; the Lucene field, the value for that field and an optional options object.

function(doc){
  index("name", doc.name, {"store": true, "index": "analyzed_no_norms"});
}

The options object has two keys; store and index. Their possible values are tabulated below.

OptionDescriptionValuesDefault
store If true, the value will be returned in the search result; if false, the value will not be returned in the search result. true, false false
index whether (and how) the data is indexed analyzed, analyzed_no_norms, no, not_analyzed, not_analyzed_no_norms analyzed

Analyzers

Analyzers define how to extract index terms from text, which you might need to do if your application need to index Chinese, for example). Here's the list of generic analyzers supported by Cloudant search. See further down for language-specific analyzers.

standard
This is the default analyzer and implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.
email
Like standard but tries harder to match an email address as a complete token.
keyword
Input is not tokenized at all.
simple
Divides text at non-letters.
whitespace
Divides text at whitespace boundaries.
classic
The standard Lucene analyzer circa release 3.1. You'll know if you need it.

You can choose which analyzer is used by your index function by changing the index definition in the design document.

Defining an analyzer

"indexes": { "mysearch" : {
  "analyzer": "whitespace", "index": "function(doc){ ... }" },
  }

Note: Changing the analyzer causes the index to be rebuilt. (Also note that queries against a given index are run with the same analyzer as is defined by the function.)

Language-specific Analyzers

We provide a large number of analyzers for specific languages. These analyzers will omit very common words in the specific language, as these tend to make poor search queries and cause considerable index bloat. Many of these also perform stemming, where common word prefixes or suffixes are removed.

+ See the full list of language-specific analyzers

Per-Field Analyzer

Sometimes a single analyzer isn't enough. You can use the perfield analyzer to configure different analyzers for different field names;

Per-field analysis

"indexes": {
  "mysearch" : {
    "analyzer": {
      "name": "perfield",
      "default": "english",
      "fields": {
        "spanish": "spanish",
        "german": "german"
      }
    },
    "index": "function(doc){ ... }"
   }
 }

Stop words

You may want to define a set of words that do not get indexed. These are called stop words. You define stop words in the design document by turning the analyzer string into an object:

A simple stop words example

"indexes": {
  "mysearch" : {
    "analyzer": {"name": "portuguese", "stopwords":["foo", "bar", "baz"]},
    "index": "function(doc){ ... }"
  },
}

Note that keyword, simple and whitespace analyzers do not support stop words.

API

As you probably noticed above, the search URL requires a q (or query) query string. This is the query that is passed on to the search index. There are two data types supported by search; string and number. The data type is auto detected. If you need to pass a number in as a string you will need to quote it, e.g. q="12".

The search URL can optionally take limit, include_docs, stale (which have the same behavior as those in the primary and secondary indexes) sort and bookmark.

Pagination and Sorting

Bookmarks allow you to efficiently skip through results you have already seen. All search results include a bookmark in their JSON response. By passing this value to the search URL via the bookmark query parameter you will see the next page of results.

Search results can be sorted ascending or descending by any numeric or string field in the index. Sort order is set by the sort query parameter, which takes a JSON string or list as its parameter. If the field is a string field, you have to add <string> to the end of the string. If you wanted to sort by age you'd query your search index with ?sort="age", if you wanted to sort descending you'd use ?sort="-age". If you wanted to search by name, you'd use ?sort="name<string>". Sorts can be applied to multiple fields, for instance ?sort=["-age", "height"] would sort by age descending then height ascending.

Sorting by Relevance

The default sort order (when you don't supply a sort parameter) is relevance, the highest scoring matches are returned first. If you specify a sort order then matches are returned in that order, ignoring relevance. If you want to include the relevance ordering in your sort order you can use the special fields -<score> and <score>.

Sorting By Distance

In addition to sorting by indexed fields, you can sort by distance from a point chosen at query time. You will need to index two numeric fields (representing the longitude and latitude of whatever you're indexing);

function(doc) {
  index("mylon", doc.longitude);
  index("mylat", doc.latitude);
}

You can then query using the special <distance...> sort field which takes 5 parameters;

longitude field name
The name of your longitude field ("mylon" in this example)
latitude field name
The name of your latitude field ("mylat" in this example)
longitude of origin
The longitude of the place you want to sort by distance from
latitude of origin
The latitude of the place you want to sort by distance from
units
The units to use ("km" or "mi" for kilometers and miles, respectively). The distance itself is returned in the order field

An example query to make this clear:

?sort="<distance,mylon,mylat,-0.14479689999996026,51.4964609,mi>"
You can combine sorting by distance with a bounding box query to perform simple geo operations.

Query Syntax

The Cloudant search query syntax is based on the Lucene syntax. Search queries take the form of name:value (unless the name is omitted, in which case they hit the default field as we demonstrated in the first example, above).

Queries over multiple fields can be logically combined and groups and fields can be grouped. The available logical operators are: AND, +, OR, NOT and -, and are case sensitive. Range queries can run over strings or numbers.

If you want a fuzzy search you can run a query with ~ to find terms like the search term, for instance look~ will find terms book and took.

You can also increase the importance of a search term by using the boost character ^. This makes matches containing the term more relevant, e.g. cloudant "data layer"^4 will make results containing "data layer" 4 times more relevant. The default boost value is 1. Boost values must be positive, but can be less than 1 (e.g. 0.5 to reduce importance).

Wild card searches are supported, for both single (?) and multiple (*) character searches. dat? would match date and data, dat* would match date, data, database, dates etc. Wildcards must come after a search term, you cannot do a query like *base.

Result sets from searches are limited to 200 rows, and return 25 rows by default. The number of rows returned can be changed via the limit parameter. The response contains a bookmark. If the bookmark is passed back as a URL parameter you'll skip through the rows you've already seen and get the next set of results.

The following characters require escaping if you want to search on them;

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /

Escape these with a preceding backslash character.

API DEMO

The animals database contains a design document that, amongst other things, defines a search index over the animal name, diet, minimum length, Latin name and class.

function(doc){
  index("default", doc._id);
  if(doc.min_length){
    index("min_length", doc.min_length, {"store": "yes"});
  }
  if(doc.diet){
    index("diet", doc.diet, {"store": "yes"});
  }
  if (doc.latin_name){
    index("latin_name", doc.latin_name, {"store": "yes"});
  }
  if (doc.class){
    index("class", doc.class, {"store": "yes"});
  }
}

With this index you can run any of these queries.

Desired resultQuery
Birds class:bird
Animals that begin with the letter "l" l*
Carnivorous birds class:bird AND diet:carnivore
Herbivores that start with letter "l" l* AND diet:herbivore
Medium-sized herbivores min_length:[1 TO 3] AND diet:herbivore
Herbivores that are 2m long or less diet:herbivore AND min_length:[-Infinity TO 2]
Mammals that are at least 1.5m long class:mammal AND min_length:[1.5 TO Infinity]
Find "Meles meles" latin_name:"Meles meles"
Mammals who are herbivore or carnivore diet:(herbivore OR omnivore) AND class:mammal

Try any of these examples in the query field, below. The server response will appear directly below.

Sign in or create a free account to try these searches in the query field, below.

To demo the API here, replicate the sample database first.

Query
https://[username].cloudant.com/animaldb/_design/views101/_search/animals?q=class:bird
Test for yourself

Grouping Results

In addition to basic searching, you can also group results by common values of a chosen field using the group_field parameter. For full details, see Docs.

Faceted Search

Cloudant Search also supports faceted searching, which allows you to discover aggregate information about all your matches quickly and easily. You can even match all documents (using the special ?q=*:* query syntax) and use the returned facets to refine your query. Indexing a facet is straightforward and can be strings or numbers;
function(doc) {
  index("type", doc.type, {"facet": true});
  index("price", doc.price, {"facet": true});
Once indexed, you can find out how many documents you have of any string facet with the counts= parameter, in addition to any query string you like. Example output for ?q=*:*&counts=["type"] follows;
{"total_rows":100000, "bookmark":"g...", "rows":[...],
 "counts":{"type":{"sofa":10.0, "chair":100.0}}
}
You can also perform range facet queries on numeric facets using the ranges= parameter. For example;
?q=*:*&ranges={"price":{"cheap":"[0 TO 100]","expensive":"{100 TO Infinity}"}}
The range facet syntax reuses the standard Lucene syntax for ranges (inclusive range queries are denoted by square brackets, exclusive range queries are denoted by curly brackets). This will return output like;
"ranges":{"price":{"cheap":101.0,"expensive":99899.0}}

Example applications

To demonstrate the functionality of search we've pulled together a couple of example applications. If you'd like to replicate them into your account you are welcome to do so, but they both use sizable datasets and will use up a significant number of Cloudant units.

Full text indexing is what Lucene is built for, and Cloudant search is no different. In this example we've taken public lobbyist disclosure dataset from the US senate. The dataset consists of 757,123 individual documents. The uncompressed XML documents are 2.5 GB on disk, and the corresponding Cloudant database is only 1.3 GB.

Geo indexing is possible with Cloudant search. By combining location awareness with other queries you can build applications that find what a user wants, where a user is. In this example we've taken the Simple Geo "places of interest" data set of over 20 million locations and combined it with searches over other values (e.g. find restaurants near the office). A simple geo-indexer couldn't do these "refined searches" because they require additional dimensions in the query.