18 May 2020

elastic: how many shard should I have

Aim for shard sizes between 10GB and 50GBedit

ref:    https://www.elastic.co/guide/en/elasticsearch/reference/current/size-your-shards.html


To check shard size:
GET _cat/shards?v=true&h=index,prirep,shard,store&s=prirep,store&bytes=gb&index=index_name*
#the store field should be in GB unit




-----------------------old-------------------------

settings": {
    "index": {
      "number_of_shards": "1",
      "number_of_replicas": "1",


number _of_shard: how many shard per index
number_of_replicas: how many clone for each shard



#notes to have different index for each day, might be too much(max shard per ES node is 1000.)
You should try separate the index per month, (and increase number_of_shards).

How much is too many shard? Basically number of shard is ~ number of total CPU core across your cluster.(or you might one to double it if you have Hyper-Threading). More than this, it might not help you to go any quicker in searching data.

#some say it would be ok if your shard size between 2GB - 8GB


ref:  https://qbox.io/blog/optimizing-elasticsearch-how-many-shards-per-index

17 May 2020

filebeat -> logstash -> rabbitmq - > logstash -> elastic

1) filebeat - logstash
  normal case, you can google it

2) logstash - rabbitmq
  https://stackoverflow.com/questions/23207812/logstash-rabbitmq-output-never-posts-to-exchange
output { 
   rabbitmq {
      codec => plain
      host => localhost
      exchange => yomtvraps
      exchange_type => direct
      key => yomtvraps

      # these are defaults but you never know...
      durable => true
      port => 5672
      user => "guest"
      password => "guest"
   }
}



3) rabbitmq -logstash
  https://discuss.elastic.co/t/rabbitmq-as-logstash-input/95756
input { rabbitmq { host => "localhost" port => 15672 heartbeat => 30 durable => true exchange => "logging_queue" exchange_type => "logging_queue" } } output { elasticsearch { hosts => "localhost:9200" } stdout {} }




4) logstash - elastic
   normal case, please google

good tmux tutorial for beginner

https://www.hamvocke.com/blog/a-quick-and-easy-guide-to-tmux/


https://linuxize.com/post/getting-started-with-tmux/

Below are some most common commands for managing Tmux windows and panes:
  • Ctrl+b c Create a new window (with shell)
  • Ctrl+b w Choose window from a list
  • Ctrl+b 0 Switch to window 0 (by number )
  • Ctrl+b , Rename the current window
  • Ctrl+b % Split current pane horizontally into two panes
  • Ctrl+b " Split current pane vertically into two panes
  • Ctrl+b o Go to the next pane
  • Ctrl+b ; Toggle between the current and previous pane
  • Ctrl+b x Close the current pane

16 May 2020

elastic roles privileges



To write/ingest indices,  user must have role with:
- cluster: "manage_index_templates", "monitor", "manage_ilm"
- indices privileges: "write","create","delete","create_index","manage","manage_ilm"



To read the indices, minumum priveleges:
- "read","view_index_metadata"






15 May 2020

setup Proxmox cluster, and ceph storage to achieve hyperconvergenc

combining multiple condition fo refine your searching in elasticsearch

a good example to learn

Bool Query fields:
- must    (and)
- must_not
- should  (or)
- filter


example 1 :(field_1 = "mana" AND field_2 = "mari")
{ "query" : { "bool" : { "must": [{ "match": { "field_1": "mana" } }, { "match": { "field_2": "mari" } }] } } }



example 2 :(field_1 != "mana"  AND field_2 != "mari")
{ "query" : { "bool" : { "must_not": [{ "match": { "field_1": "mana" } }, { "match": { "field_2": "mari" } }] } } }



example 3 : (field_1 = "mana" OR field_2 = "mari")
{ "query" : { "bool" : { "should": [
{ "match": { "field_1": "mana" } },
{
"match": { "field_2": "mari" } }] } } }


example 4 : (field_1 = "mana")
{
  "query": {
    "bool" : {
      "filter"
: {          "term": {          "field_1": "mana"         }       }
    }
  }
}
### filter is much less expensive, as it will NOT have scoring



example 5: to combine filter and others:
(( field_1 = "mana" and field_2 = "mari") and field_3 = "arah")
{ "query" : { "bool" : { "must": [{ "match": { "field_1": "mana" } }, { "match": { "field_2": "mari" } }],
"filter" : { "term": { "field_3": "arah" } } }
} }
### again, 'filter' is less expensive compare to 'match', as it will NOT have scoring

ref:
https://www.elastic.co/blog/lost-in-translation-boolean-operations-and-filters-in-the-bool-query

14 May 2020

painless (elasticsearch)

To check is field exist:
if (doc['alert.metadata.direction.keyword'].size()!=0){
  // do something
}

or
if (!doc['alert.metadata.direction.keyword'].empty){
  // do something
}


11 May 2020

Query vs Filter [elasticsearch]

Filter is more cheap than Query!

Query involve scoring.... expensive
Filter


Query:
GET /_search
{
"query": {
"term": {
"": {
"value": ""
}
}
}
}
expensive.



wrapped with filter to speed up things:
GET /_search
{
"query": {
"constant_score" : {
"filter" : {
"term" : {"" : ""}
}
}
}
}



ref:  https://towardsdatascience.com/deep-dive-into-querying-elasticsearch-filter-vs-query-full-text-search-b861b06bd4c0

python elasticsearch



To read a documents:
  res = es.get(index='rules-test',
          doc_type='_doc',
          id='cJG_UnEB1e0J32pjx0pG')



To update a documents:
   doc = {
    "doc": {           #      < - - cannot use other than "doc"
        "new_or_existing_field": {
        "sub_field": {
        "other_name" : "cadang1",
        "other_class": "cadang2"
        }
        }
    }
}

   es.update(index='rules-test',
          doc_type='_doc',
          id='cJG_UnEB1e0J32pjx0pG',
          body=doc)




To filter result:
  query = {
                "query": {
                    "bool": {
                        "must": [
                            {   "term": {
                                    "feed.keyword": args.feed
                                }
                            },
                            {   "term": {
                                    "version.keyword": args.version
                                }
                            }
                        ]
                    }
                },
                "aggs": {
                    "uniq_f_name": {
                        "terms": {
                            "field": "file_name.keyword",
                            "size": 200
                        }
                    }
                },
                "size": 0
            }
    result = es.search(index=args.es_index, body= query)
    # print(json.dumps(query), "\n",  result)

    for bucket in result['aggregations']['uniq_f_name']['buckets']:
        print( bucket)




To count: 
- add 'size=0' parameter when call search() method
res = client.search(index = "indexname", doc_type = "doc_type", body = q, size=0)

# if res['hits']['total']['value'] is greater than 10K, means total record is larger, and you need to use scroll to retrieve all the documents.



To use scroll:
result = es.search(index="indexname", body= query, scroll = '1m')
while result['hits']['hits'] > 0:
      result = es.scroll(scroll_id= scroll_id, scroll = '1m')
      print(result['hits']['hits'])


Elasticsearch query API

List all indices:
  GET /_cat/indices


List all field in index:
  GET /rules-test


List all documents(records) in index:
   GET /rules-test/_search/


Get document by ID:
   GET /rules-test/_doc/cJG_UnEB1e0J32pjx0pG



Count documents:
   GET /rules-test/_count?q=user:kimchy

10 May 2020

Timezone for mft, plaso, elasticsearch

MFT store timestone in UTC, user(window explorer) will convert to chosen timezone when display to user.

log2timeline will parse the disk partition and put data in dot.plaso file, using UTC timezone.

When psort.py run againts dot.plaso, it will produce dot.csv file, using UTC timezone.


When data in the csv is push into elasticsearch, elasticsearch will always assume the timezone is UTC,

Then, when kibana display the data through browser, it will convert the timezone base on browser timezone(which is same as user desktop timezone).


But bare in mind, if you query directly to elasticsearch using your own script/tools, the timezone is in UTC.

Missleading psort.py (plaso) timezone paremeter


psort.py -z Singapore  -o l2tcsv -w result.csv  input_data.plaso



psort.py has '-z' option which is for timezone parameter.

But it will not convert the time to the preferred timezone, instead it just put the label.


For example if we run the command with '-z UTC' , it will result as:
     04/15/2020,23:59:26,UTC, ........

And if we run the psort command with  -z Singopore, the result is:
   04/15/2020,23:59:26,Singapore, .....


The timestamp is exactly same.


07 May 2020

Kibana substring in scripted fields

Let say we have field datetime with type date in elasticsearch (eg: "2018-09-15T17:16:47") .
We want to get value 17 (hours),

We can achieve this by create Scripted field either  (1) in kibana scripted field, or (2)put it directly in your query.
Name it to sc-hour_myt.


(1) kibana scripted field.
String masa = String.valueOf(doc['datetime'].value); 

// since my timezone is +0800
int hour = Integer.parseInt(masa.substring(11,13)) + 8 ; 
if (hour >= 24) {
hour = hour - 24;  
}

String minute = masa.substring(14,16);
String second = masa.substring(17,19);

String utc = String.valueOf(hour) + minute + second;
return hour;


(2) put directly in your query:
{
  "query": {
    "match_all": {}
  },
  "_source": ["datetime"],
  "script_fields": {
    "sc-hour_myt": {
      "script": {
        "lang": "painless",
        "source": """
              String masa = String.valueOf(doc['datetime'].value);
              int hour = Integer.parseInt(masa.substring(11,13)) + 8 ;
              if (hour >= 24) {
              hour = hour - 24; 
              }
              String minute = masa.substring(14,16);
              String second = masa.substring(17,19);
             
              String utc = String.valueOf(hour) + minute + second;
              return hour;
        """
      }
    }
  }
}


--------------------------------------------------------------

This will result:
{ "_id": "CqSU6nEBOMr994UREJt1", "datetime": "2018-09-15T07:16:47", "sc-hour_myt": [ 15 ] }, { "_id": "C6SU6nEBOMr994UREJt1", "datetime": "2018-09-15T07:16:47", "sc-hour_myt": [ 15 ] },