ElasticSearch Snippets

28 December 2019

I do a bunch of work with ElasticSearch, building tools so researchers can search through large amounts of data. I've had to figure out a bunch of useful queries for searching, aggregations, deletes, and index management.

In this example, I will use an index, twitter, that has account, tweet, retweet_count, language, and country fields.

Each record/document is a tweet, and one of the accounts is for @devnambi

Searches

Search everything

GET twitter/_search
{
    "query": {
        "match_all": {}
    }
}

Search for a single word

GET twitter/_search
{
  "query": { 
    "bool": { 
      "must": [
        { "match": { "account":   "devnambi" }}
      ]
    }
  }
}

Search for multiple words

GET twitter/_search
{
  "query": { 
    "bool": { 
      "must": [
        { "match": { "tweet":   "Hello World" }}
      ]
    }
  }
}

Search for multiple lower-case words

GET twitter/_search
{
  "query": { 
    "bool": { 
      "must": [
        { "match": { "tweet":   "hello world" }}
      ]
    }
  }
}

Records for a given dataset name (w/ tokens)

Single-word:

GET twitter/_search
{
  "query": {
    "bool": {
      "must": [{
          "term": {
            "tweet": "example"
          }
        }
      ]
    }
  }
}
}

Multiple-word:

GET twitter/_search
{
  "query": {
    "bool": {
      "must": [{
          "term": {
            "tweet": "include"
          }
        },
        {
          "term": {
            "tweet": "all"
          }
        },
        {
          "term": {
            "tweet": "these"
          }
        },
        {
          "term": {
            "tweet": "words"
          }
        }
      ]
    }
  }
}
}

Aggregations

Count of records by country

GET twitter/_search
{
  "size": 0,
  "aggs" : { 
        "source" : { 
            "terms" : { 
              "field" : "country.keyword"
            }
        }
    }
}

Count number of records for a filter

GET twitter/_search
{
    "size" : 0,
    "query": {
    "bool": {
      "must": [
        {
          "match": {
            "country.keyword": "US"
          }
        }
      ]
    }
  },
    "aggs" : { 
        "source" : { 
            "terms" : { 
              "field" : "country.keyword"
            }
        }
    }
}

Sum of retweets for an account

GET twitter/_search
{
    "size" : 0,
    "query": {
    "bool": {
      "must": [
        {
          "match": {
            "account.keyword": "devnambi"
          }
        }
      ]
    }
  },
    "aggs" : { 
        "source" : { 
            "sum" : { 
              "field" : "retweet_count"
            }
        }
    }
}

Get tweet count by account

GET twitter/_search
{
    "size" : 0,
    "aggs" : { 
        "source" : { 
            "terms" : { 
              "field" : "account.keyword"
            }
        }
    }
}

Get average size of all tweets in the UK

GET twitter/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "country.keyword": "UK"
          }
        }
      ]
    }
  },
  "aggs":{
    "avg_length" : { "avg" : { "script" : "_source.tweet.toString().getBytes(\"UTF-8\").length"}}
  }
}

Get unique count of tweets in Germany

GET twitter/_search
{
    "size" : 0,
    "query": {
    "bool": {
      "must": [
        {
          "match": {
            "country.keyword": "DE"
          }
        }
      ]
    }
  },
    "aggs" : 
    {
      "unique_filecount": {
          "cardinality" : 
          { 
            "field" : "tweet"
          }
      }
    }
}

Get records with missing retweet_count field by country (no retweets)

GET twitter/_search
{
  "size": 5,
  "query": {
    "bool": {
      "must_not": [
        {
          "exists": {
            "field": "retweet_count"
          }
        }
      ]
    }
  },
  "aggs": {
    "sources": {
      "terms": {
        "field": "country.keyword",
        "size": 10
      }
    }
  }
}

Get multiple aggregations

GET twitter/_search
{
    "size" : 0,
    "aggs" : { 
        "countries" : { 
            "terms" : { 
              "field" : "country.keyword"
            }
        },
        "accounts" : { 
            "terms" : { 
              "field" : "account.keyword"
            }
        }
        ,"languages" : { 
            "terms" : { 
              "field" : "language.keyword"
            }
        }
    }
}

Sources: % Increase, Term Percentage, Search Agg Pipeline

Deletes

Delete all tweets by account name

POST /twitter/_delete_by_query?conflicts=proceed
{
  "query": {
    "bool": {
      "must": [{
          "term": {
            "account": "devnambi"
          }
        }
      ]
    }
  }
}
}

Index Management

Index name in examples: twitter
Alias name in examples: social_media

Get Indexes

GET /_cat/indices?v

Delete an index

DELETE /twitter

Change the index refresh interval

PUT /twitter/_settings
{
    "index" : {
        "refresh_interval" : "1s"
    }
}

Aliases

Get aliases

GET /_cat/aliases?v

Set an alias

POST /_aliases
{
    "actions" : [
        { "add" : { "index" : "twitter", "alias" : "social_media" } }
    ]
}

Remove an alias

POST /_aliases
{
    "actions" : [
        { "remove" : { "index" : "twitter", "alias" : "social_media" } }
    ]
}

No piece of code is ever done. I can think of various improvements to improve query performance or add functionality. I'll add those over time.

Happy coding!