28 December 2019
I do a bunch of work with ElasticSearch, building tools so researchers can search through large amounts of data. I’ve had to figure out a bunch of useful queries for searching, aggregations, deletes, and index management.
In this example, I will use an index, twitter
, that has account
, tweet
, retweet_count
, language
, and country
fields.
Each record/document is a tweet, and one of the accounts is for @devnambi
Search everything
GET twitter/_search
{
"query": {
"match_all": {}
}
}
Search for a single word
GET twitter/_search
{
"query": {
"bool": {
"must": [
{ "match": { "account": "devnambi" }}
]
}
}
}
Search for multiple words
GET twitter/_search
{
"query": {
"bool": {
"must": [
{ "match": { "tweet": "Hello World" }}
]
}
}
}
Search for multiple lower-case words
GET twitter/_search
{
"query": {
"bool": {
"must": [
{ "match": { "tweet": "hello world" }}
]
}
}
}
Records for a given dataset name (w/ tokens)
Single-word:
GET twitter/_search
{
"query": {
"bool": {
"must": [{
"term": {
"tweet": "example"
}
}
]
}
}
}
}
Multiple-word:
GET twitter/_search
{
"query": {
"bool": {
"must": [{
"term": {
"tweet": "include"
}
},
{
"term": {
"tweet": "all"
}
},
{
"term": {
"tweet": "these"
}
},
{
"term": {
"tweet": "words"
}
}
]
}
}
}
}
Count of records by country
GET twitter/_search
{
"size": 0,
"aggs" : {
"source" : {
"terms" : {
"field" : "country.keyword"
}
}
}
}
Count number of records for a filter
GET twitter/_search
{
"size" : 0,
"query": {
"bool": {
"must": [
{
"match": {
"country.keyword": "US"
}
}
]
}
},
"aggs" : {
"source" : {
"terms" : {
"field" : "country.keyword"
}
}
}
}
Sum of retweets for an account
GET twitter/_search
{
"size" : 0,
"query": {
"bool": {
"must": [
{
"match": {
"account.keyword": "devnambi"
}
}
]
}
},
"aggs" : {
"source" : {
"sum" : {
"field" : "retweet_count"
}
}
}
}
Get tweet count by account
GET twitter/_search
{
"size" : 0,
"aggs" : {
"source" : {
"terms" : {
"field" : "account.keyword"
}
}
}
}
Get average size of all tweets in the UK
GET twitter/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"country.keyword": "UK"
}
}
]
}
},
"aggs":{
"avg_length" : { "avg" : { "script" : "_source.tweet.toString().getBytes(\"UTF-8\").length"}}
}
}
Get unique count of tweets in Germany
GET twitter/_search
{
"size" : 0,
"query": {
"bool": {
"must": [
{
"match": {
"country.keyword": "DE"
}
}
]
}
},
"aggs" :
{
"unique_filecount": {
"cardinality" :
{
"field" : "tweet"
}
}
}
}
Get records with missing retweet_count field by country (no retweets)
GET twitter/_search
{
"size": 5,
"query": {
"bool": {
"must_not": [
{
"exists": {
"field": "retweet_count"
}
}
]
}
},
"aggs": {
"sources": {
"terms": {
"field": "country.keyword",
"size": 10
}
}
}
}
Get multiple aggregations
GET twitter/_search
{
"size" : 0,
"aggs" : {
"countries" : {
"terms" : {
"field" : "country.keyword"
}
},
"accounts" : {
"terms" : {
"field" : "account.keyword"
}
}
,"languages" : {
"terms" : {
"field" : "language.keyword"
}
}
}
}
Sources: % Increase, Term Percentage, Search Agg Pipeline
Delete all tweets by account name
POST /twitter/_delete_by_query?conflicts=proceed
{
"query": {
"bool": {
"must": [{
"term": {
"account": "devnambi"
}
}
]
}
}
}
}
Index name in examples: twitter
Alias name in examples: social_media
Get Indexes
GET /_cat/indices?v
Delete an index
DELETE /twitter
Change the index refresh interval
PUT /twitter/_settings
{
"index" : {
"refresh_interval" : "1s"
}
}
Get aliases
GET /_cat/aliases?v
Set an alias
POST /_aliases
{
"actions" : [
{ "add" : { "index" : "twitter", "alias" : "social_media" } }
]
}
Remove an alias
POST /_aliases
{
"actions" : [
{ "remove" : { "index" : "twitter", "alias" : "social_media" } }
]
}
No piece of code is ever done. I can think of various improvements to improve query performance or add functionality. I’ll add those over time.
Happy coding!