Sunday, December 7, 2014

Phrase search on not_analyzed fields?



The short answer is no: you can’t do phrase search on not_analyzed fields in Kibana (my test is on Kibana3).

The long answer is more interesting. Let us start the story from the beginning. 

I use ELK to analyze static log files retrieved from customers: I use Logstash to organize massive logged information into events, such as “DB Connection Error”, “restart” etc based on regex pattern matching; I use Kibana to display event histograms and to analyze event correlations. To do histogram on event field, event shouldn’t be analyzed, so the mapping template specifies this:


"event": {"type": "string", "index": "not_analyzed" }
 
Kibana issues the following histogram aggregation request to ElasticSearch:

{
    "query": {
        "filtered": {
            "query": {
                "bool": {
                    "should": [
                        {
                            "query_string": {
                                "query": "event:restart"
                            }
                        }
                    ]
                }
            },
            "filter": {
                "bool": {
                    "must": [
                        {
                            "range": {
                                "@timestamp": {
                                    "from": 1416960000000,
                                    "to": 1417046400000
                                }
                            }
                        }
                    ]
                }
            }
        }
    },
    "aggregations": {
        "0": {
            "date_histogram": {
                "field": "@timestamp",
                "interval": "10m"
            },
            "aggs": {
                "1": {
                    "terms": {
                        "size": 200,
                        "field": "event"
                    }
                }
            }
        }
    },
    "size": 0
}
 


So far, everything seems normal, here comes the twist of the story. Suppose there are a few “DB Connection Error”, “restart” and other events in ElasticSearch, the below table shows the result for different query strings.



Query string
result
"event:restart"
Correct result
"event:rest*"
Correct result
"event:\"DB Connection Error\""
Correct result
"event:DB Connection Error"
Incorrect Result
"event:DB Connection"
Incorrect Result
"event:DB*"
Zero Result




I couldn’t reconcile these different results, so I used PerfSpy to capture the code track, and discovered how ElasticSearch deals with query string.



  "query": "event:DB Connection"




For query string "event:DB Connection", ElasticSearch constructs a BooleanQuery with two TermQuery, notice the second TermQuery’s field is “_all”, not “event”, surprise! 


This solves the mystery why "event:\"DB Connection Error\"" gets correct result, while "event:DB Connection Error" and "event:DB Connection" not.

What about "event:DB*"?

"query": "event:DB*"


 The mystery is solved. ElasticSearch constructs a PrefixQuery, however the query text is db, not DB. PrefixQuery is case sensitive, it lowercases query text. Since event field is not analyzed, “DB Connection Error” is stored as “DB Connection Error”, so “db*” doesn’t hit anything. 

Keyword tokenizer + lowercase filter


The solution for “db*” is easy, what is needed is an analyzer that does lowercasing but nothing else. Add such an analyzer into the mapping template:

{
  "template" : "logstash-*",
  "settings" : {
    "index.refresh_interval" : "5s",
       "index": {
        "analysis": {
            "analyzer": {
                "lowcasekeyword": {
                    "type": "custom",                  
                    "tokenizer": "keyword",
                    "filter": [
                        "lowercase"
                    ]
                }
            }
        }
    }
  }
….
"event": {"type": "string", "analyzer": "lowcasekeyword" },
"subevent": {"type": "string", "analyzer": "lowcasekeyword" }
 


When constructing a query, ElasticSearch will consult the field mapping and use the same analyzer to analyze the query string:




But there is still no way to do “DB Connection*” type of search.




 



No comments:

Post a Comment