Rockset

    Text Search Functions

    This page covers functions that can be used to make queries involving a text search.

    Text search functions operate on fields that have been tokenized into arrays of strings; you may do so by creating an ingest transformation using the TOKENIZE function.

    In addition, the SEARCH function only works in the WHERE clause of a query on a collection. It does not work if the target of a query is a subquery or a JOIN, and it does not work in other parts of the query outside of the WHERE clause.

    SEARCH may be combined with other operators and functions in the same WHERE clause.

    If you use the SEARCH function, SCORE() becomes available as a function with no arguments, returning the score of the match.

    The collection data used in the SQL queries on this page contains the following 4 documents.

    +------+-------------------------------------------------------------------------+
    | id   | words                                                                   |
    |------+-------------------------------------------------------------------------|
    | doc1 | ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"] |
    | doc2 | ["hello", "world", "goodbye", "world"]                                  |
    | doc3 | ["cat", "dog", "fox"]                                                   |
    | doc4 | ["hello", "goodbye"]                                                    |
    +------+-------------------------------------------------------------------------+

    Basic Functions

    TOKENIZE

    TOKENIZE(text[, locale])

    Tokenize text into array of strings, normalized and lower-cased. text is interpreted as text in the language specified by locale. locale is specified using the ICU format. The default is en_US_POSIX. This function is usually used in an ingest transformation to convert a text field into array of strings at the time of document ingestion. Once an input text field is converted to array of strings, you can use SEARCH function in a SQL query to perform text search using this array of strings.

    SQL commandResult
    SELECT TOKENIZE('Hello world')["hello","world"]
    SELECT TOKENIZE('Hi, how are you?')["hi","how","are","you"]
    SEARCH(term_matcher[, term_matcher]*) [OPTION(match_all=false)]

    Return documents that satisfy all of the term matchers (by default), or at least one of the term matchers (with OPTION(match_all=false)). Term matchers are arbitrary SQL boolean expressions, optionally boosted. The score of the match is the sum of the boost values (default 1) for all terms that matched. For searches that include proximity ranges (see below), each proximity range also adds to the score a proximity score that depends on the longest consecutive match.

    /*
    In this example, `doc2` got 2 points, 1 point each for the occurrence of words
    ['hello', 'world']. No other document contains both 'hello' and 'world' in the
    field words.
    */
    SELECT
        _id,
        words,
        score() score
    FROM
        data
    WHERE
        search(
            has_term(words, 'hello'),
            has_term(words, 'world')
        )
    ORDER BY
        score() DESC
    +------+----------------------------------------+-------+
    | id   | words                                  | score |
    |------+----------------------------------------|-------|
    | doc2 | ['hello', 'world', 'goodbye', 'world'] | 2     |
    +------+----------------------------------------+-------+

    In the rest of the page, we describe how to use SEARCH with the supported term matchers.

    Term Matcher Functions

    BOOST

    BOOST(boost_value, term)

    Set the boost value (a positive floating point value) for a specific search term so that the term contributes the specified value. Unboosted terms contribute a default score of 1.0. Note that following sample queries use collection data described at the top of this page.

    /*
    In this example, `doc2` got:
    - 1 point for the occurrence of the word 'hello'
    - 3 points for the occurrence of the word 'world'
    `doc4` got:
    - 1 point for the occurrence of the word 'hello'
    */
    SELECT
        _id,
        words,
        score() score
    FROM
        data
    WHERE
        search(
            has_term(words, 'hello'),
            boost(3, has_term(words, 'world'))
        ) OPTION(match_all = false)
    ORDER BY
        score() DESC
    +------+----------------------------------------+-------+
    | id   | words                                  | score |
    |------+----------------------------------------|-------|
    | doc2 | ['hello', 'world', 'goodbye', 'world'] | 4     |
    | doc4 | ['hello', 'goodbye']                   | 1     |
    +------+----------------------------------------+-------+
    /*
    In this example, `doc4` got:
    - 10 points for the occurrence of the word 'hello'
    - 1 point for the occurrence of the word 'goodbye'
    - 0.5 points due to proximity range search as the longest run in it
      ['hello', 'goodbye'] is of length 2
    `doc2` got:
    - 10 points for the occurrence of the word 'hello'
    - 1 point for the occurrence of the word 'goodbye'
    - 0 points due to proximity range search as the longest run in it ['hello']
      or ['goodbye'] is of length 1
    */
    SELECT
        _id id,
        words,
        score() score
    FROM
        data
    WHERE
        search(
            proximity(
                boost(10, has_term(words, 'hello')),
                has_term(words, 'goodbye')
            )
        ) OPTION(match_all = false)
    ORDER BY
        score() DESC
    +------+----------------------------------------+-------+
    | id   | words                                  | score |
    |------+----------------------------------------|-------|
    | doc4 | ['hello', 'goodbye']                   | 11.5  |
    | doc2 | ['hello', 'world', 'goodbye', 'world'] | 11    |
    +------+----------------------------------------+-------+
    /*
    In this example, `doc4` got 3 * (10 + 1 + 0.5) + 5 points, where
    - 10 points are for the occurrence of the word 'hello'
    - 1 point is for the occurrence of the word 'goodbye'
    - 0.5 points are due to proximity range search as the longest run in it
      ['hello', 'goodbye'] is of length 2
    - 5 points are for the occurrence of the word 'goodbye'
    `doc2` got 3 * (10 + 1 + 0) + 5 points, where
    - 10 points are for the occurrence of the word 'hello'
    - 1 point is for the occurrence of the word 'goodbye'
    - 0 points are due to proximity range search as the longest run in it ['hello']
      or ['goodbye'] is of length 1
    - 5 points are for the occurrence of the word 'goodbye'
    */
    SELECT
        _id id,
        words,
        score() score
    FROM
        data
    WHERE
        search(
            BOOST(
                3,
                proximity(
                    BOOST(10, has_term(words, 'hello')),
                    has_term(words, 'goodbye')
                )
            ),
            BOOST(5, has_term(words, 'goodbye'))
        ) OPTION(match_all = false)
    ORDER BY
        score() DESC
    +------+----------------------------------------+-------+
    | id   | words                                  | score |
    |------+----------------------------------------|-------|
    | doc4 | ['hello', 'goodbye']                   | 39.5  |
    | doc2 | ['hello', 'world', 'goodbye', 'world'] | 38    |
    +------+----------------------------------------+-------+

    CONTAINS

    CONTAINS(field, search_string[, locale])

    Tokenize the given search string in the given locale (default: en_US) and create a proximity range from the terms. Note that following sample queries use collection data described at the top of this page.

    /*
    In this example, `doc4` got:
    - 2 points, 1 point each for the occurrence of words ['hello', 'goodbye']
    - 0.5 points due to proximity range search as the longest run in it
      ['hello', 'goodbye'] is of length 2
    `doc2` got:
    - 2 points, 1 point each for the occurrence of words ['hello', 'goodbye']
    - 0 points due to proximity range search as the longest run in it ['hello']
      or ['goodbye'] is of length 1
    */
    SELECT
        _id id,
        words,
        score() score
    FROM
        data
    WHERE
        search(contains(words, 'hello goodbye')) OPTION(match_all = false)
    ORDER BY
        score() DESC
    +------+----------------------------------------+-------+
    | id   | words                                  | score |
    |------+----------------------------------------|-------|
    | doc4 | ['hello', 'goodbye']                   | 2.5   |
    | doc2 | ['hello', 'world', 'goodbye', 'world'] | 2     |
    +------+----------------------------------------+-------+
    /*
    In this example, `doc1` got:
    - 5 points, 1 point each for the occurrence of words
      ['quick', 'brown', 'fox', 'jumps' 'over']
    - 2 points due to proximity range search as the longest run in it
      ['quick', 'brown', 'fox', 'jumps', 'over'] is of length 5
    `doc3` got:
    - 1 point for the occurrence of the word 'fox'
    - 0 points due to proximity range search as the longest run in it ['fox']
      is of length 1
    */
    SELECT
        _id id,
        words,
        score() score
    FROM
        data
    WHERE
        search(
            contains(
                words,
                'quick brown fox jumps over that elephant'
            )
        ) OPTION(match_all = false)
    ORDER BY
        score() DESC;
    +------+-------------------------------------------------------------------------+-------+
    | id   | words                                                                   | score |
    |------+-------------------------------------------------------------------------|-------|
    | doc1 | ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog'] | 7     |
    | doc3 | ['cat', 'dog', 'fox']                                                   | 1     |
    +------+-------------------------------------------------------------------------+-------+
    /*
    In this example, `doc1` got:
    - 3 points, 1 point each for the occurrence of words ['brown', 'fox', 'over']
    - 0.5 points due to proximity range search as the longest run in it
      ['brown', 'fox'] is of length 2
    `doc3` got:
    - 1 point for the occurrence of the word 'fox'
    - 0 points due to proximity range search as the longest run in it ['fox']
      is of length 1
    */
    SELECT
        _id id,
        words,
        score() score
    FROM
        data
    WHERE
        search(
            contains(
                words,
                'majestic brown fox jumped over that elephant'
            )
        ) OPTION(match_all = false)
    ORDER BY
        score() DESC;
    +------+-------------------------------------------------------------------------+-------+
    | id   | words                                                                   | score |
    |------+-------------------------------------------------------------------------|-------|
    | doc1 | ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog'] | 3.5   |
    | doc3 | ['cat', 'dog', 'fox']                                                   | 1     |
    +------+-------------------------------------------------------------------------+-------+

    HAS_TERM

    HAS_TERM(field, term)

    A term matcher that lets you search documents where the specified field contains the specified term. Upon a successful match, this matcher contributes a score of 1. Note that following sample queries use collection data described at the top of this page.

    /*
    In this example, `doc2` got 2 points, 1 point each for the occurrence of words
    ['hello', 'world']. No other document contains both 'hello' and 'world' in the
    field words.
    */
    SELECT
        _id,
        words,
        score() score
    FROM
        data
    WHERE
        search(
            has_term(words, 'hello'),
            has_term(words, 'world')
        )
    ORDER BY
        score() DESC
    +------+----------------------------------------+-------+
    | id   | words                                  | score |
    |------+----------------------------------------|-------|
    | doc2 | ['hello', 'world', 'goodbye', 'world'] | 2     |
    +------+----------------------------------------+-------+
    /*
    In this example, `doc2` got 2 points, 1 point each for the occurrence of words
    ['hello', 'world']. `doc4` got 1 point for the occurrence of the word 'hello'.
    */
    SELECT
        _id,
        words,
        score() score
    FROM
        data
    WHERE
        search(
            has_term(words, 'hello'),
            has_term(words, 'world')
        ) OPTION(match_all = false)
    ORDER BY
        score() DESC
    +------+----------------------------------------+-------+
    | id   | words                                  | score |
    |------+----------------------------------------|-------|
    | doc2 | ['hello', 'world', 'goodbye', 'world'] | 2     |
    | doc4 | ['hello', 'goodbye']                   | 1     |
    +------+----------------------------------------+-------+

    PROXIMITY

    PROXIMITY(term_matcher[, term_matcher]*)

    Create a proximity range matcher using the given term matchers. The term matchers will contribute 1 point to the final score, just like other term matchers in the SEARCH query. But the proximity range mather will also contribute a proximity score that depends on the longest consecutive match. The longest consecutive match is the longest sequence of N consecutive terms such that: - term at index T satisfies term matcher at index P - term at index T+1 satisfies term matcher at index P+1 - ... - term at index T+N-1 satisfies term matcher at index P+N-1 Based on the length N of the longest consecutive match, the proximity score is computed as ((N - 1) / 2). Note that following sample queries use collection data described at the top of this page.

    /*
    In this example, `doc4` got:
    - 2 points, 1 point each for the occurrence of words ['hello', 'goodbye']
    - 0.5 points due to proximity range search as the longest run in it
      ['hello', 'goodbye'] is of length 2
    `doc2` got:
    - 2 points, 1 point each for the occurrence of words ['hello', 'goodbye']
    - 0 points due to proximity range search as the longest run in it ['hello']
      or ['goodbye'] is of length 1
    */
    SELECT
        _id id,
        words,
        score() score
    FROM
        data
    WHERE
        search(
            proximity(
                has_term(words, 'hello'),
                has_term(words, 'goodbye')
            )
        ) OPTION(match_all = false)
    ORDER BY
        score() DESC
    +------+----------------------------------------+-------+
    | id   | words                                  | score |
    |------+----------------------------------------|-------|
    | doc4 | ['hello', 'goodbye']                   | 2.5   |
    | doc2 | ['hello', 'world', 'goodbye', 'world'] | 2     |
    +------+----------------------------------------+-------+
    /*
    In this example, `doc1` got:
    - 5 points, 1 point each for the occurrence of words
      ['quick', 'brown', 'fox', 'jumps' 'over']
    - 2 points due to proximity range search as the longest run in it
      ['quick', 'brown', 'fox', 'jumps', 'over'] is of length 5
    `doc3` got:
    - 1 point for the occurrence of the word 'fox'
    - 0 points due to proximity range search as the longest run in it ['fox']
      is of length 1
    */
    SELECT
        _id id,
        words,
        score() score
    FROM
        data
    WHERE
        search(
            proximity(
                has_term(words, 'quick'),
                has_term(words, 'brown'),
                has_term(words, 'fox'),
                has_term(words, 'jumps'),
                has_term(words, 'over'),
                has_term(words, 'that'),
                has_term(words, 'elephant')
            )
        ) OPTION(match_all = false)
    ORDER BY
        score() DESC;
    +------+-------------------------------------------------------------------------+-------+
    | id   | words                                                                   | score |
    |------+-------------------------------------------------------------------------|-------|
    | doc1 | ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog'] | 7     |
    | doc3 | ['cat', 'dog', 'fox']                                                   | 1     |
    +------+-------------------------------------------------------------------------+-------+
    /*
    In this example, `doc1` got:
    - 3 points, 1 point each for the occurrence of words ['brown', 'fox', 'over']
    - 0.5 points due to proximity range search as the longest run in it
      ['brown', 'fox'] is of length 2
    `doc3` got:
    - 1 point for the occurrence of the word 'fox'
    - 0 points due to proximity range search as the longest run in it ['fox']
      is of length 1
    */
    SELECT
        _id id,
        words,
        score() score
    FROM
        data
    WHERE
        search(
            proximity(
                has_term(words, 'majestic'),
                has_term(words, 'brown'),
                has_term(words, 'fox'),
                has_term(words, 'jumped'),
                has_term(words, 'over'),
                has_term(words, 'that'),
                has_term(words, 'elephant')
            )
        ) OPTION(match_all = false)
    ORDER BY
        score() DESC;
    +------+-------------------------------------------------------------------------+-------+
    | id   | words                                                                   | score |
    |------+-------------------------------------------------------------------------|-------|
    | doc1 | ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog'] | 3.5   |
    | doc3 | ['cat', 'dog', 'fox']                                                   | 1     |
    +------+-------------------------------------------------------------------------+-------+

    PROXIMITY_BOOST

    PROXIMITY_BOOST(boost_value, proximity_range)

    Set the proximity range boost value (a positive floating point value) for the given specific search proximity range. This affects the proximity range's contribution to the score i.e. the proximity score, but not the contributions of the individual terms in the range. proximity_range must be a proximity range created using either PROXIMITY or CONTAINS. Note that following sample queries use collection data described at the top of this page.

    /*
    In this example, `doc4` got 2 + 10 * 0.5 points, where
    - 2 points are for the occurrence of the words ['hello', 'goodbye']
    - 0.5 points are due to proximity range search as the longest run in it
      ['hello', 'goodbye'] is of length 2
    `doc2` got 2 + 10 * 0 points, where
    - 2 points are for the occurrence of the words ['hello', 'goodbye']
    - 10 * 0 points due to proximity range search as the longest run in it ['hello']
      or ['goodbye'] is of length 1
    */
    SELECT
        _id id,
        words,
        score() score
    FROM
        data
    WHERE
        search(
            proximity_boost(
                10,
                proximity(
                    has_term(words, 'hello'),
                    has_term(words, 'goodbye')
                )
            )
        ) OPTION(match_all = false)
    ORDER BY
        score() DESC
    +------+----------------------------------------+-------+
    | id   | words                                  | score |
    |------+----------------------------------------|-------|
    | doc4 | ['hello', 'goodbye']                   | 7     |
    | doc2 | ['hello', 'world', 'goodbye', 'world'] | 2     |
    +------+----------------------------------------+-------+
    /*
    In this example, `doc4` got 5 + 1 + 10 * 0.5 points, where
    - 5 points are for the occurrence of the word 'hello'
    - 1 point is for the occurrence of the word 'goodbye'
    - 0.5 points are due to proximity range search as the longest run in it
      ['hello', 'goodbye'] is of length 2
    `doc2` got 5 + 1 + 10 * 0 points, where
    - 5 points are for the occurrence of the word 'hello'
    - 1 point is for the occurrence of the word 'goodbye'
    - 10 * 0 points are due to proximity range search as the longest run in it
      ['hello'] or ['goodbye'] is of length 1
    */
    SELECT
        _id id,
        words,
        score() score
    FROM
        data
    WHERE
        search(
            proximity_boost(
                10,
                proximity(
                    boost(5, has_term(words, 'hello')),
                    has_term(words, 'goodbye')
                )
            )
        ) OPTION(match_all = false)
    ORDER BY
        score() DESC
    +------+----------------------------------------+-------+
    | id   | words                                  | score |
    |------+----------------------------------------|-------|
    | doc4 | ['hello', 'goodbye']                   | 11    |
    | doc2 | ['hello', 'world', 'goodbye', 'world'] | 6     |
    +------+----------------------------------------+-------+