What are /r/elasticsearch's favorite Products & Services?

Since Elasticsearch 2 default translog durability (index.translog.durability) was changed from async to request. Change it back to async and you should have similar performance as on ES 1.7. https://www.elastic.co/guide/en/elasticsearch/guide/current/translog.html

It's pretty much wrong to call these things out as "mistakes" IMO. All the Zen discovery settings were deprecated over two years ago, they don't do anything any more except emit warnings that you're using a deprecated setting. Similarly, the official recommendation is not to use bootstrap.memory_lock: you should prefer simply to disable swap altogether.

[I work at Elastic]

in 7.10 there's searchable snapshots - https://www.elastic.co/blog/whats-new-elastic-7-10-0-searchable-snapshots-lens-user-experience-monitoring

Extreme stark differences with how AWS handles Free and Open Source Software vs how Google handles Free and Open Source Software. Say what you will about Google and their data vacuuming, but when you look at how Google treats open source projects, it really makes you want to choose them vs AWS.

AWS clones an imcomplete version of MongoDB and releases it for production, same thing with elasticsearch and many many other services.

On the other hand... Google takes the time to work with the Vendors of open source projects and gets the vendors to integrate them directly into GCP so you can actually get a legit version of your open source project.

https://cloud.google.com/blog/products/open-source/bringing-the-best-of-open-source-to-google-cloud-customers https://www.elastic.co/gcp

For context: I am one of those crazy people that has Amazon Alexas and then go on /r/privacytoolsIO and tells my friends and family to not let too much of their lives get taken over by google.

So, I have probably had this conversation with 100 different people.

My honest answer is that by far the most preferred solution is to denormalize your data. The second best solution would be to do application side joins with multiple queries.

To be clear, you can do queries to find parents of children, so it is possible to do what you mentioned. However, I see many database-minded people try and do things with parent-child relationships and nested objects and put a lot of effort into something that is ultimately abandoned (almost always due to scalability or performance issues). You may be the one that figures it out, but my experience would tell me it is unlikely.

I started off in the SQL world and know that the Elasticsearch paradigm is different, and I empathize. ES is designed to return results back in milliseconds, so things like an application side join are quite practical. Normalization makes queries lighting fast, if at all possible.

Try to use Elasticsearch like a search engine, not an RDBMS, and you will be much happier.

Good luck 😀

[I work for Elastic]

As others have pointed out, the best-practice for a production cluster, starts at 3 nodes. Here's what the docs say:

> High availability (HA) clusters require at least three master-eligible nodes, at least two of which are not voting-only nodes. Such a cluster will be able to elect a master node even if one of the nodes fails.

If you're running in Elastic Cloud, you can provision smaller nodes and get a 3rd Master node (which only needs 1GB RAM) for free. There's a nice configurator here. You'll also be able to use Machine Learning there, too.

I also work at Elastic.

Hopefully, this should clear this up: https://www.elastic.co/pricing/faq/licensing (updated to include more information)

The TL;DR (Please read the whole FAQ, anyhow) is that if you are already using the default distribution under the Elastic license, it's the same as it has been, per this paragraph:

> If you download and use our default distribution of Elasticsearch and Kibana, nothing changes for you. Our default distribution continues to be free and open under the Elastic License, as it has been for nearly the last three years. If you build applications on top of Elasticsearch, nothing changes for you. Our client libraries continue to be licensed under Apache 2.0. If you use plugins on top of Elasticsearch or Kibana, nothing changes for you.

If you are building from source/modifying the source and compiling it yourself, to host a service, you can reach out to:

Have you reviewed the documentation?

For example https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html has some initial tips (including noting that more shards doesn't necessarily mean better query perf), and that page will open the correct section of the index to link to a few other basic tuning pages.

It depends on the value of the word safe (As all software contains bugs, but some you may never ever encounter, and others can be bad for you), but as I understand it, it means there will be no more releases of that particular version.

7.11.x will not have a higher value for x. That doesn't mean that it is not safe, but if you need a fix or feature, you should go to a newer release that has it, since this will not be backported to a 7.11 version, if that makes sense.

EOL means a bit more to subscription customers (https://www.elastic.co/support_policy has a bit more on that), but that's the general gist of it.

Probably a bit of an anti pattern to run two intensive applications on the same host without probably segregating the resource usages (I.e cgroups). If one Java process is trying to allocate direct memory buffers but the OS is under memory pressure then I’d say increased gc time and more frequent gc events are likely.

You could consider changing the index store type if nothing else, though changing it to something like niofs will likely incur a larger context switching penalty and overall cpu overhead for disk io

https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-store.html

We didn't end up using it, but we have looked into it quite a bit.

It mainly seems to focus on important services that ElasticSearch keeps locked behind the xpack paywall. The main things we were looking at:

Kibana Authentication
- Elastic only allows basic auth. OpenDistro has LDAP and SAML options.
Alerting
- Elastalert is really the only alerting option if you aren't xpack or OpenDistro. It works, but it's definitely lacking. We're trying out a GUI (praeco) which should hopefully help out with that.
Machine learning
- Totally locked behind elastic paywall

That said, we ended up not using it. Main reasons:

You are locked into the OSS Version. You lose a lot of features when you drop from basic (which is still free) and OSS. Us even more since I think there are some things in the k8s version that aren't in the basic version.
The security is based on SearchGuard, which Elastic is currently suing. (That said, AWS has publicly stated support for SearchGuard so that might deter further action).
Less frequent updates

I can't speak for versions as old as this one, but in modern Elasticsearch 21GB is a pretty small index that's probably best suited to a single-shard configuration. A common recommendation is to aim to have each shard in the 20-40GB range: https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

Refer to the documentation here:

https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html#_consider_mapping_identifiers_as_literal_keyword_literal

"The fact that some data is numeric does not mean it should always be mapped as a numeric field. The way that Elasticsearch indexes numbers optimizes for range queries while keyword fields are better at term queries. Typically, fields storing identifiers such as an ISBN or any number identifying a record from another database are rarely used in range queries or aggregations. This is why they might benefit from being mapped as keyword rather than as integer or long."

it's not a database, it doesn't have acid compliance

make sure you read https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html

if you install new cluster with els6 next to els2 cluster, you can use reindex from remote api, it will reindex old els2 data to els6 https://www.elastic.co/guide/en/elasticsearch/reference/current/reindex-upgrade-remote.html

Yes, that would be a great use case for Watcher.

Check out the documentation here. I think you would want to take the first example and change the query to look for your condition instead of the word 'error'. It should be a simple substitution.

One more thing, X-Pack takes all the individual plugins (Watcher, Marvel, Shield, Graph) and some new features and told them into a single integrated installer that is much more unified and GUI friendly. X-Pack has all the pieces to help you build an enterprise-ready production system. It is 5.0 and above, but if you are using 2.X you can install the plugins (like Watcher) individually.

Full Disclosure: I work for Elastic. Feel free to PM me.

Hi.

I work for Elastic.

(By the way, it may help you to search for Elasticsearch vs Elastic Search)

There's also this: https://www.elastic.co/training/free which has some courses that may help you get started. There's also a trick, I've found when looking for how to use a specific product set, which is to search on Google for "Getting started with XYZ" which is the product you are trying to use.

We also have paid courses if you'd like.

As an assumption, I figured, if you are adding it as a search bar to your website, this may be a good start: https://www.elastic.co/guide/en/app-search/current/getting-started.html

Working with arrays of data inside ES isn't particularly intuitive.

If you've got specific groupings you're looking for, the filters aggregation is what you need. Would look something like this:

  "aggs" : {
    "data" : {
      "filters" : {
        "filters" : {
            "a|b" :   {
                "bool" : {
                    "filter" : [
                        { "term" : { "data" : "a" } },
                        { "term": { "data" : "b" } }
                    ]
                }
            },
        }
      }
    }
  }

If you're looking to query against terms based on the entire array, you might be better off merging it as part of indexing or otherwise finding a way to restructure the data to suit this need.

and consider using a filter rather than a query - https://www.elastic.co/guide/en/elasticsearch/reference/7.15/query-filter-context.html

these are usually more efficient to run

It depends on your mapping. Ideally, accountId is has a “keyword” mapping. Then you can use a “term” query for an exact match.

More info here: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html

I would encourage you to look into Transforms[1].
You can use transforms to create summary indices that you can use for reporting purposes and delete them once you are done with them.
[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/transforms.html

try source filtering when you are doing your request - https://www.elastic.co/guide/en/elasticsearch/reference/7.14/search-fields.html#source-filtering

that said, this sort of extraction is typically done as you index the data to Elasticsearch, not after

Use the mapping API and define your mappings prior to writing data to the index. You can use dynamic mappings and index templates to apply mappings when an index is created. https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html

(I work at Elastic)

Have you seen the new frozen data tier and searchable snapshots? You can make rotating data into S3/GCS/Azure Object Store a part of your index lifecycle. And when needed you can search that data too.

Frozen Tier: https://www.elastic.co/blog/introducing-elasticsearch-frozen-tier-searchbox-on-s3

Searchable Snapshots: https://www.elastic.co/blog/introducing-elasticsearch-searchable-snapshots

It shows shards as:

index-name shard-index shard-type status doc-count size node

So you're taking the first field - the name of the index - and piping it to the delete index API.

I see there's an article rather dubiously suggesting this :/

https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html

Read the whole thing. Every page.

Repeat every example, with variants on your own cluster.

Work very hard to not compare Elasticsearch to a relational database. Elasticsearch is not a database.

https://www.elastic.co/blog/why-license-change-AWS

Obviously this is going to be a one-sided blogpost, but it's hard not to empathise with Elastic (yes I'm empathising with a multi-billion dollar company lol.) I think suddenly removing that Apache license was always going to scare so many people, and probably not the best decision though

[I work at Elastic]

since we published our initial blog, we have added two posts with additional details: License change clarification and Why we had to the change license

Definitely have a look at the visual builder in Kibana. Ive built funnels using it. It can do multiple percentages on the same visualization (filter ratio), and if you have multiple visual builder visualizations on the same dashboard they share a cursor. Also you can save the time period as part of the dashboard. I use the latest version of Kibana so I hope what I described is available in what you're using.

https://www.elastic.co/guide/en/kibana/current/time-series-visual-builder.html

An elasticsearch index is analogous to a database table. You should flatten your data before storing it in elasticsearch. The best way to get data from SQL Server into Elasticsearch would be to use Logstash: JDBC Input https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html

Elasticsearch Output:

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

Reindex by query with a sliding window every hour, or whatever. Curator could kick this off, or just script it in whatever language.

The easiest way to test behavior is in Dev Tools with a sample document & index. For example:

# Create a new index & document PUT /test/_doc/1 { "foo": "bar" }

# Get the document GET /test/_doc/1

# Add a new field PUT /test/_doc/1 { "fizz": "buzz" }

# Update an existing field PUT /test/_doc/1 { "foo": "test" }

# Delete the document & index DELETE /test

Elasticsearch is a search index. It helps not to think of it as a SQL database, but instead as a flattened datastore built for fast information retrieval. There is no recursion or foreign keys. You send data (aka documents) into "indexes" and Elasticsearch indexes that data based on field types. There's some high-level guidance on how to think about modeling your data in Elastic that I've written here.

[I work for Elastic] (but not in sales, and can empathize, as I was a former customer, too)

While this may be a single feature you wish to use, there are a lot more features that Gold provides, including access to support, which is backed up with our developers, as well.

https://www.elastic.co/subscriptions shows the other features that are available at each level.

https://cloud.elastic.co/pricing also has pricing if you did want it hosted, and you could go month-to-month on a gold or higher level with that, and it shows you the approximate pricing it would cost (approximate because there are data transfer costs, and storage costs for snapshots, which are separate and depend on your activities)

I'm not sure if you've had on premises software vendor pricing for other tools (and this is not in the slightest meant as a disparaging remark) but comparatively, for the functionality, the pricing is far lower than some, and relatively in line, though I don't know if that provides you any comfort.

You could use a unique attribute of each document as the _id field for the document, or the fingerprint processor in an ingest pipeline. This would overwrite any existing documents in your index with the same _id when you write a new batch.

You could also look into using a latest transform to create an index that only contains the most recent version of the data for each unique record. It also supports TTLs if you need to age off old data.

The definitive guide is dated but comes from more of a background you’re asking for, and those concepts have changed much less than the stack setup, etc:

https://www.elastic.co/guide/en/elasticsearch/guide/current/search-in-depth.html

So, first thing to know is that in an EQL sequence, order matters. So you should list the sequence items in the order that you want to find them (based upon @timestamp). Next, for each query in the brackets, the first word is used to filter the query based upon event.category. In your case, you have file events and network events. The last thing to determine is if you have data for all the events you're looking for.

If you had packetbeat on the host (now available in Fleet on 7.15!), you could get the network fields for http. You could get the file fields using Endpoint Security. There are definitely some other data sources that could provide this information (such as Zeek or maybe a proxy), but you might have to adjust which fields you use.

I've taken your query and modified it to use ECS field names, which would align with the packetbeat + endpoint scenario above. I don't have sample data to test the network logs on, but this is at least a valid query that should do what you want.

sequence by host.hostname with maxspan=30s
[network where http.request.method : "GET"]
[network where http.response.status_code == 200]
[file where file.extension : "html"]
[network where http.request.method : "GET"]
[network where http.response.status_code == 200]
[file where file.extension : "cab"]

*Note: I used host.hostname because it's a well-defined field and should match the output of the hostname command. host.name is more subjective and could be a "friendly name", and may not be consistent across different data shippers.

Full disclosure, I work at Elastic. Not trying to peddle wares above with the beats, but trying to use a well-defined example.

*EDIT: Formatting

Appreciate the responses!

I've taken a look at some of the free offerings for ECE - Specifically https://www.elastic.co/training/ece-fundamentals

But that training tends to focus on how to use ECE and less on how ECE works; this breeds a dependence on ECE support whenever problems occur within the environment essentially making me as an administrator not very confident managing the software.

Is there good writeups on how ECE works? I have no experience with ELK prior to this position but Elasticsearch/Kibana are relatively intuitive - Its the ECE distributed systems that stumps me, I've tried to read up on zookeeper but I'm having issues relating what I'm reading on zookeeper to the ECE environments.

Any advice is appreciated thank you

I would try / consider the following:

Profile your query
Make sure time filters (and possibly others filters) are being used so Elastic can limit data being searched
Consider increasing resources given to Kibana and/or Elasticsearch (CPU & RAM)
Last resort, consider breaking out indices by data source (not app)

100GiB / day isn't huge, but it's also not trivial. You shouldn't tolerate 30-90 second loads here.

I wouldn't buy it. The book is outdated and also free online here:

https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html

If you're new to Elasticsearch, you won't know what is still relevant. A lot of the information in this guide has been incorporated into the Elasticsearch documentation. Your best resource is the Elasticsearch documentation & Elastic website. There's also various blogs and videos on the Elastic web site. I'd focus on specific topics you're interested and review the applicable documentation, videos and blogs.

Other resources that can be helpful for specific questions is Stack Overflow and the Elastic forums.

Howdy! A couple of things to consider:

1) I'm assuming you're using one of the more popular logging libraries, are you using the ECS layout for your logs to make it output in the elastic Schema from the start? https://www.elastic.co/guide/en/ecs-logging/java/current/setup.html

2) Have you considered elastic APM for Java? It can work with your logs (if using one of the preferred log libraries) and bring in your app traces right alongside - pretty sweet!

3) Check out the Elastic Agent, it's like a lot of the beats combined!

It's a bit different: https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-restore.html

Basically, the gist of it is this. Data in Elasticsearch is stored in something called indices. An index is made up of one or more shards (which are actually a Lucene index, as Lucene is the main library that Elasticsearch is built on, but this is more for context, than having to understand the inner workings of Lucene).

Those shards are made up of segments, and those segments are what map to files on disk.

Snapshots are an intelligent way to take segments and the changes that happen in the index and incrementally back those up.

So, for a quick rundown/scenario, let's say you take a snapshot on Aug 1st, 2021. Then you take another Aug 2nd, 2021 (You can take them every hour, or every 4hrs or whatever, but this is for simplicity)

Only the changed segments are then shipped to the repository that holds your snapshots.

If you delete a snapshot, which you can do, but some of those segments are needed to satisfy another snapshot, it doesn't delete those segments.

SO... long story short (too late, sorry) you can have snapshots of your EDR/XDR going back for 3 months, 9 months, a year, etc, and if you need to check an event and say "Wait, something happened with $THING. Let's check how far back this went." You can actually search into your snapshots (Elasticsearch will, using special nodes called frozen nodes) and pull back the data from them that you are searching, so it may take 3, 5, 9 minutes or so, but you can go back and say "Oh, this happened 5 months ago, on a Tuesday." without having to keep all that data live, on nodes that cost more to run.

[I work for Elastic]

Your iOS app will likely be talking to a back-end of some kind, where you'll likely have a database to store other information needed by your app. Elastic should sit right next to the database in your architecture, so that all communication with it is proxied through your back-end. When you do this, your back-end will apply filters (no paid license required) so that a user query can only see resources they created (these are stored in Elasticsearch as "documents"). Elastic offers hosted clusters if you want to try something out, at https://cloud.elastic.co/pricing.

Here's how Filters work: https://www.elastic.co/guide/en/elasticsearch/reference/current/filter-search-results.html

If there isn't an existing filebeat module for those types of logs you can do some simple parsing in the filebeat agent with the dissect processor (https://www.elastic.co/guide/en/beats/filebeat/master/dissect.html), but likely your best bet would be to use an ingest pipeline (https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html). Ingest pipelines can do a lot of what logstash does from a parsing perspective (grok and etc) so you should be covered in most cases.

Try this:
https://www.elastic.co/guide/en/elasticsearch/reference/current/logging.html#configuring-logging-levels

Yes! Definitely, use your support contract for this. You can even ask support to connect you with your sales team. Your Solutions Architect should be able to connect you with some best practices, etc.

First step, create a support ticket: https://www.elastic.co/support/welcome

[I work at Elastic]

https://www.elastic.co/workplace-search with a custom source would do what you want, there's no out of the box connector for that one sorry

As referenced here, under the section "Don’t Cross 32 GB!":

> Once you cross that magical ~32 GB boundary, the pointers switch back to ordinary object pointers. The size of each pointer grows, more CPU-memory bandwidth is used, and you effectively lose memory. In fact, it takes until around 40–50 GB of allocated heap before you have the same effective memory of a heap just under 32 GB using compressed oops.

https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html#compressed_oops

If you are using the default binary, you keep using the same license as you did in the past 2+ years. The graphic at the top of the FAQ hopefully makes that clear.

[Disclaimer: I work for Elastic]

First, you should always be pre-defining your mappings. It's the key to speed of ES. The default guessed types may "work", but they aren't always optimal, and may in fact be wrong - e.g. a text type for a phone number would result in 123 456 7890 matching a query for 456 123 7890. IP addresses are another place the default text+keyword field seems to work but it wildly permissive for the type, whereas ip lets you do range searches, use CIDR notation, etc.

To your direct question, just set <code>index: false</code> for the field to not index it, but yet store it for later retrieval and display.

Its the fact its being read by the file input. The file input expects an event per line usually, so each line in your pretty printed file is being treated as a separate event which breaks up the pretty printed JSON. Logstash is then trying to parse each event with the json filter, but blows up as its only given a single line at a time.

Read the docs on the multiline codec, you might be able to read the input file and use it to construct a valid event:

https://www.elastic.co/guide/en/logstash/current/plugins-codecs-multiline.html

Sure! I've been using it since 0.12 and work there, helping our users get the most out of the stack.

https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-indexing-speed.html#_use_auto_generated_ids

As with anything, it's just a guideline, and if your speed suits your needs, so be it. You can and should always benchmark this for yourself, on your infrastructure, with your data, and see what the impact is, and if that outweighs the benefits or not, since you already have a sizable investment.

I’ve found using the char_filter works best. No reason for creating synonyms every time you want to eliminate hyphens.

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html

Just set the mapping of hyphens to an empty string.

{ "your_index_name":{ "aliases": {}, "mappings": { "your_search_field": { "type": "text", "analyzer": "custom_analyzer" } }, "settings":{ "index":{ "analysis": { "char_filter": { "my_hyph_filter": { "pattern": "-", "type": "pattern_replace", "replacement": "" } }, "analyzer": { "custom_analyzer": { "filter": [ "lowercase", "my_stemmer" ], "char_filter": [ "my_hyph_filter" ], "type": "custom", "tokenizer": "standard" } }
} } } } }

Hi, yea every time i need to check the docu for this, you need to add the s3 id + secret key in your keystore of es

https://www.elastic.co/guide/en/elasticsearch/reference/current/elasticsearch-keystore.html

Check for your version, then you only need to refer the keystore entry name

I don't want to be offending, or anything, but this has litterally been a google search and 2 kicks away from you:

https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html#plugins-inputs-file-path

If you don't understand how to use this, you should probably just start reading about logstash and elastic from the very top.

You just need to set the network.host to localhost. Then setup Nginx as follows:

http { server { listen 8080; location / { proxy_pass http://localhost:9200; allow x.x.x.x; deny all; } } }

More info here: https://www.elastic.co/blog/playing-http-tricks-nginx

Persistent Queues and back pressure handling would be one

https://www.elastic.co/guide/en/logstash/current/persistent-queues.html

I would also repeat and stress your first bullet. If some of my data feeds go through logstash, personally I would want them all so I have one point to track logging/status, reporting, analytics, etc.

Any reason you aren't using Beats, given you already have an ELK stack? Metricbeat already feeds system information into logstash/elastic, no need to reinvent the wheel here. See: https://www.elastic.co/products/beats/metricbeat

Holy crap.

Save yourself all that hassle and get Cerebro, and use the Cluster Allocation Explain API if shards won't allocate.

Have a read of this https://www.elastic.co/elasticon/2015/sf/scaling-elasticsearch-for-production-at-verizon

it's a bit older than some vieos but shows the process that they went through and some of the issues faced when scaling for huge amounts of data.

> ElasticSearch

Elasticsearch

> Any ideas?

You can do this kind of thing with pipeline aggregations but it is super inefficient. You could do a more efficient thing with a scripted metric aggregation, but only if you indexed the data so that each sequence of documents for which you wanted to calculate the difference ended up on the same shard.

Neither way is nice because this is outside of Elasticsearch's wheelhouse. Elasticsearch is designed to find the best document or to tell you things about the documents on aggregate, but it has trouble telling you about how documents relate to one another. A fairly efficient algorithm for getting this information would probably be to iterate the events in order and keep a the time from the previous, subtract, and store the max. Elasticsearch can't iterate the documents in order because it only has a single iteration order internally: whatever order Lucene feels like giving the documents back to it, which is somewhat correlated with arrival time. I could hack other iteration orders on top of this but they'd be inefficient because it'd have to do it in two passes. That is pretty much what you'd do with the scripted metric aggregation.

But those algorithms only work if all of the documents that you want to compare are in the same shard. If they aren't in the same shard then Elasticsearch would have to ship all of the hits back to a single node, something that is flat out refuses to do outside of the _scroll API.

So the answer is: this is not the sort of question Elasticsearch was built to solve.

Should Curator run on each one? --- Nope. Only run Curator on one node. One instance of curator can manage data for many clusters at once.

Regarding your question about shrinking indices; you pretty much have the correct process. Yes that function does reduce the total shard count of an index. A good start is to define the initial shard count for your daily index using an index template. Then use curator to shrink the index over time, reducing the total amount of storage consumed. Then finally perform the force merge.

You said you were going to run the job every week? Sure you can do that. It's up to you. I personally run curator between 3am and 5am every single day because I'm managing a fairly large cluster (thousands of indices).

Pro-tip --- Get all of your Elastic Stack configuration files in source control (git). This is a must.

https://www.elastic.co/guide/en/elasticsearch/client/curator/current/shrink.html https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html

[I work at Elastic]

we have a blog post from earlier this year on this topic https://www.elastic.co/blog/hosted-elasticsearch-services-roundup-elastic-cloud-and-amazon-elasticsearch-service, it does dive into some of the pricing comments

> If it's deprecated in 2.2, I don't want to see it in the reference docs for that particular version.

We'd be slain by everyone that's used the software for a while if we just hid it. It still works until the next major version and people will use it while they migrate. OTOH I'd say its a documentation bug if there is a deprecation notice that doesn't mention what your migration path should be. So the notices should contain a link to to the right place. There are plenty of these bugs though.

> Elasticsearch in Action

There is also the definitive guide. It is written for 1.x and its being worked on furiously to get it compatible with 2.x. It is genuinely hard to keep a book like that up to date.

> huge amount of inconsistency in the REST API, particularly concerning compound/bool queries

The whole bool query thing is funky. It exists because those are the Lucene queries work. In some respects it is "closer to the metal" than any AND/OR/NOT queries would be.

Do you have any specific examples of inconsistent stuff? I'm happy to fix stuff or you can submit PR yourself if you are wiling to sign the CLA.

Like you say - that analyzer is the issue. The default analyzer for strings is a decent default for full text search over English-like languages. Its not so good for the more exact search you need for metrics. Its all how the string is broken into terms. I think the guide does a better job explaining it than I would.

It's explained in the middle of the Index-Time Search page.

>Completion Suggester

>Using edge n-grams for search-as-you-type is easy to set up, flexible, and fast. However, sometimes it is not fast enough. Latency matters, especially when you are trying to provide instant feedback. Sometimes the fastest way of searching is not to search at all.

>The completion suggester in Elasticsearch takes a completely different approach. You feed it a list of all possible completions, and it builds them into a finite state transducer, an optimized data structure that resembles a big graph. To search for suggestions, Elasticsearch starts at the beginning of the graph and moves character by character along the matching path. Once it has run out of user input, it looks at all possible endings of the current path to produce a list of suggestions.

>This data structure lives in memory and makes prefix lookups extremely fast, much faster than any term-based query could be. It is an excellent match for autocompletion of names and brands, whose words are usually organized in a common order: “Johnny Rotten” rather than “Rotten Johnny.”

>When word order is less predictable, edge n-grams can be a better solution than the completion suggester. This particular cat may be skinned in myriad ways.

that's not correct. TLS and basic authentication and access control are free and have been for a few years now - https://www.elastic.co/blog/security-for-elasticsearch-is-now-free

things like AD integration are paid features, yes

I believe you're on the right track with the monitoring UI not showing logs for 6.x clusters. Based on a quick look at the Kibana stack monitoring code it seems like it depends on the logs having the `elasticsearch.cluster.uuid` field in them. This does not seem to exist in the logs for 6.x clusters, and as you said, 6.x does not support outputting logs in json format.

If it is important that the UI displays the logs from the appropriate cluster, you may be able to set the `elasticsearch.cluster.uuid` field to the correct value with a Filebeat processor

If you're ingesting a log file, something that will continually grow, it might be easier to ingest it with Filebeat.

If you still want to use Python, it's perfectly fine, it just requires you to know that language compared to a more "configuration driven" approach when using Filebeat.

Can you share an example line or two of the data? Remove anything that is sensitive or change it enough to where you're comfortable sharing it. But it will help us understand how to approach ingestion.

If you simply have an error you want to get past, you can share that, too.

Try minimum_should_match = 3, see https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-minimum-should-match.html#query-dsl-minimum-should-match

That's probably not proper json syntax :-)

Unfortunately every change is painful. Here's some background that might help and why we think this is still important to do:

The old HLRC heavily depends on the Elasticsearch server codebase. As Elasticsearch 8.0 sets the baseline to Java 17, we think this constraint will not be acceptable for a lot of environments.
Users will be able to use HLRC 7.latest with Elasticsearch 8.0 by enabling the compatibility mode in their application, which asks Elasticsearch 8.0 to behave like Elasticsearch 7.x.
HLRC and the new Java Client can run side by side with no operational overhead by sharing the same HTTP client. With compatibility mode, existing code based on HLRC can be kept as is, and new 8.0-only APIs can use the new Java Client. See the [migration docs](https://www.elastic.co/guide/en/elasticsearch/client/java-api-client/current/migrate-hlrc.html).
HLRC being intimately tied to the Elasticsearch server code makes it a big tech debt as we can't refactor or change significant parts of the Elasticsearch code base as they're exposed by HLRC. Plus it is a large dependency in size and dependency hell to drag into every project.

With the new Java Client going stable, we will also update the documentation around the points above — sorry that we missed that so far!

There's no requirement that security needs to be enabled for monitoring to work. When you have a basic license the security features are disabled by default:

https://www.elastic.co/guide/en/elasticsearch/reference/7.15/security-minimal-setup.html#_enable_elasticsearch_security_features

You only enable security by setting the following in the elasticsearch.yml by manually adding:

>xpack.security.enabled: true

So as far as I know, enabling monitoring from Kibana shouldn't also enable security. If you're at all concerned about it. You can manually enable monitoring by doing:

PUT _cluster/settings
{
"persistent": {
"xpack.monitoring.collection.enabled": true
}
}

There's also some docs that talk about how to configure a two-node cluster here:

https://www.elastic.co/guide/en/elasticsearch/reference/current/high-availability-cluster-small-clusters.html#high-availability-cluster-design-two-nodes

Beats have processors which allow them to do some parsing, including a script processor. But in general, Logstash or Ingest Pipelines are generally preferred, since deploying a parser with them means all ingest goes through it. If you parse using Beats, you would need to update every Beat's config in order for all ingest to be run through the parser.

You can customize the stop-token filter: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html

Fuzzy Auto only allows up to 2 edits/typos for strings 5 or more chars long. It is not suitable for autocomplete. For strings less than 5, it only allows 1, which is why "piet" doesn't match pietje (since it requires two edits, adding j and e).

Instead consider using a suggester with max_edits enabled.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html

You have to do it on the Elasticsearch cluster because it’s an ingest pipeline.

Filebeat GEO-IP

Just follow that and you’re good to go. After adding the pipeline make sure to update your filebeat YAML as it shows.

depends, what is many? can you just batch them up and use _bulk every N seconds?
does https://www.elastic.co/guide/en/elasticsearch/reference/7.15/docs-update.html#_update_part_of_a_document help?
daily indices apply for time based data, which this (probably) isn't

the biggest issue here is that an update basically deletes the old document and creates a new one. these are cleaned out with merges, which can be IO intensive. I'd suggest you use dedicated SSD type storage for this use case to handle that

The reason that I suggested more nodes, even irrespective of cache hits, that recommendation was not meant as a blanket for you or to guarantee it would perform better, but it might allow it to absorb a system fault better.

There's more to optimizing this than just the shards, and having all the load (including indexing, ingestion, coordinating function, and perhaps master roles) go through a single node (even with a tiebreaker, that just allows a vote, so that node could be performing master role function) can cause a cascade failure.

Off is, unfortunately, also a speed (Applied Blender Logic) and the hit to having things fail can be worse than some small penalties to latency to gain that functionality.

That may not be the best case for your specific issue, but it was a concern I had and wanted to make sure it was brought up.

What you can also do, to help some of this, is, particularly if you are aggregation heavy, is to set eager global ordinals, and even perhaps do some routing/sorting at the shard level to help this.

Use rounded date/time

Pre load some stuff, maybe?

https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html - for reference

from: https://www.elastic.co/guide/en/elasticsearch/reference/current/transform-usage.html
You might want to consider using transforms instead of aggregations when:

You need a complete feature index rather than a top-N set of items.In machine learning, you often need a complete set of behavioral features rather just the top-N. For example, if you are predicting customer churn, you might look at features such as the number of website visits in the last week, the total number of sales, or the number of emails sent. The Elastic Stack machine learning features create models based on this multi-dimensional feature space, so they benefit from the full feature indices that are created by transforms.This scenario also applies when you are trying to search across the results of an aggregation or multiple aggregations. Aggregation results can be ordered or filtered, but there are limitations to ordering and filtering by bucket selector is constrained by the maximum number of buckets returned. If you want to search all aggregation results, you need to create the complete data frame. If you need to sort or filter the aggregation results by multiple fields, transforms are particularly useful.
You need to sort aggregation results by a pipeline aggregation.Pipeline aggregations cannot be used for sorting. Technically, this is because pipeline aggregations are run during the reduce phase after all other aggregations have already completed. If you create a transform, you can effectively perform multiple passes over the data.
You want to create summary tables to optimize queries.For example, if you have a high level dashboard that is accessed by a large number of users and it uses a complex aggregation over a large dataset, it may be more efficient to create a transform to cache results. Thus, each user doesn’t need to run the aggregation query.

Might not be specially what you're asking for but if you're dealing with file paths then see if the path Heirarchy Tokenizer might help you match easier, especially the detailed examples: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pathhierarchy-tokenizer.html#analysis-pathhierarchy-tokenizer-detailed-examples

I work for Elastic:

https://www.elastic.co/what-is/opensearch

Elasticsearch was not renamed to Opensearch, as an FYI. This is the result of a forked codebase and is not the same as Elasticsearch, and the divergence will only grow over time.

They should be considered different products with a similar point of origin.

Here's a curl example to import a Saved Object:

https://www.elastic.co/guide/en/kibana/7.x/saved-objects-api-import.html#saved-objects-api-import-example-1

TL;DR

curl -X POST "localhost:5601/api/saved_objects/_import?createNewCopies=true -H "kbn-xsrf: true" --form [email protected]" -H 'kbn-xsrf: true'

Update the endpoint and add auth.

You should check out the new Elastic Agent and Fleet. With Fleet, you can create a policy for each of your different use cases (domain controller, file server, etc). Fleet offers central management, so if you need to change something later, you can edit the data collected through the UI and it will push the new config out to beats through the Elastic Agent.

https://www.elastic.co/guide/en/fleet/current/fleet-overview.html

Overall offers a much easier way to manage multiple beats rather than running them individually.

ok that makes sense!

what about https://www.elastic.co/guide/en/elasticsearch/reference/7.14/query-dsl-range-query.html#ranges-on-dates in the query?

if by not long ago you mean 2 years :p (https://www.elastic.co/blog/security-for-elasticsearch-is-now-free)

but yes, this is a tension that we are well aware of

I would start with a simple Dashboard[1], then if you want something custom, advance to Canvas[2].

[1] https://www.elastic.co/guide/en/kibana/current/create-a-dashboard-of-panels-with-web-server-data.html

[2] https://www.elastic.co/blog/visualizing-security-data-canvas

[I don't work for Elastic]

But this, and search templates https://www.elastic.co/guide/en/elasticsearch/reference/current/search-template.html are a perfect fit here...

Post the payload and the _settings and _mapping.

Also set dynamic appropriately during dev. You can change it for UAT and PROD.

https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic.html#dynamic-parameters

If they're searching on the same document set and it makes sense in your application for a single result set, you could combine them with Query DSL and a compound bool query using an array of should clauses, which is equivalent to an "OR" condition.

Well the link you posted is relevant for v2 of Elasticsearch. I think today in (v7) the feature you are looking for is document routing. This Blog explains some details.

In general document routing is putting everything that belongs to one user into one shard. Making it much faster to search for data that only belongs to that user.

I haven't used Dejavu but this sounds like the issue you're hitting:

https://github.com/appbaseio/dejavu/issues/409

AWS ES instances require AWSv4 signing, and it doesn't sound like Dejavu supports it natively yet.

Elastic.co (the creators of the ELK stack) also provide hosted Elasticsearch clusters in AWS. They have basic auth over HTTPS which Dejavu supports, if you want to give that a try.

[I work for Elastic]

You are upgrading from 7.12.x to 7.13.y? What are x and y?

Are you doing a rolling upgrade, following these steps:

https://www.elastic.co/guide/en/elasticsearch/reference/current/rolling-upgrades.html

Is there anything interesting in the logs after each step?

Is each document in Elasticsearch stored with a Store ID (e.g., store_id)? If so, it's very easy for Elasticsearch to limit results to just a single store. You would use a bool query with a filter[1]. Your SaaS back-end would take a user's credentials, grab the Store ID they belong to, and then add it as a filter to any of their Elasticsearch queries. That would scope all their queries to only return results from their store.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

The biggest thing I had to learn coming from a Splunk background is that what you used to be able to do from a single query is now two separate functions. You’re going to want to look at creating a visualization of a table: https://www.elastic.co/guide/en/kibana/current/get-started.html

Using the data table, you can create a table of the count by whatever criteria you want.

Firstly, if this is for an enterprise you don't want "An elasticsearch node" you want a cluster. That cluster should be, at a minimum, 3 nodes.

For storage, you would want the fastest available, assuming you have no hot/warm/cold/frozen tiering (fast/slower/slow/offline-but-searchable)

You are welcome to use that, or use our docker containers or even ECK (Elastic Cloud for Kubernetes) if you would rather, to set some of this up.

https://www.elastic.co/training/free may be a good start.

As to the storage side, however, if you are not sure, you can either, as I said, change out VMs (by using an exclude to remove all the shard data from a node, and when it is empty, shutting it down, and replacing it with a new node with more storage) or you can use something like LVM to increase the size of the volume.

Just realize, LVM has some overhead involved.

Elasticsearch can grow from a single node to dozens or even hundreds of nodes, and then have the roles they do broken out so that specific nodes handle only incoming requests, or managing the cluster, or only a specific type of data, but that can come over time.

The best part is, this is not something you HAVE to decide right now. You can make changes to a cluster, though, some come with downtime and some are easy enough to fix while it is up and running with nobody being the wiser.

[I work at Elastic]

Elasticsearch has its own built in certificate utilities (and it's good practice to use SSL/TLS either way).

https://www.elastic.co/guide/en/elasticsearch/reference/current/certutil.html

https://www.elastic.co/guide/en/elasticsearch/reference/7.13/configuring-stack-security.html

I'm not sure I follow about "Where the storage is."

I would not recommend network storage for elasticsearch. Elasticsearch is an I/O intensive search application and trying to force it to use network storage is akin to getting a fancy new sports car and buying the most inexpensive tires you can acquire. It won't behave how you would expect.

Gauging storage is going to depend on your use case. Realize, Elasticsearch is a clustered application that scales horizontally. You can add nodes, replace nodes, etc if you need to.

eg; filter { translate { field => "[my_ip]" destination => "[my_ip_to_dns]" dictionary => { "192.168.0.1" => "hosta" "192.168.0.2" => "hostb" "192.168.0.3" => "hostc" "192.168.0.4" => "hostd" } } }

or store them in a file - https://www.elastic.co/guide/en/logstash/current/plugins-filters-translate.html#plugins-filters-translate-dictionary_path

RE: How Elasticsearch is better than a LIKE query:

Elasticsearch does text analysis and is highly customisable and flexible for that purpose.

Elasticsearch text analysers can break apart pieces of text so you have much greater control over the recall and precision of your queries.

You can do things like stemming, so management, managing, and managed, all resolve to the same root word and will all match a query for manage.

Proximity-based searches are another great feature; find my and pants within 1 word of each other, and so on.

There are many, many ways to configure the text analysis for your needs, and lots of different query types to use.

Have a look here and here for more information about what you can do.

I hope this helps.

You're trying to use ndjson but that format is only allowed for bulk requests. Also the format is incomplete. Please check https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

Its’s a keyword as an inner field. So the field to search on would actually be “emailId.keyword” in this scenario. OP could change to search for that field and it could work

https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html#multi-fields

What are /r/elasticsearch's favorite Products & Services?
From 3.5 billion Reddit comments

The most popular Products mentioned in /r/elasticsearch:

Learning Kibana 7: Build powerful Elastic dashboards with Kibana's data visualization capabilities, 2nd Edition

Elasticsearch: The Definitive Guide: A Distributed Real-Time Search and Analytics Engine

Advanced Elasticsearch 7.0: A practical guide to designing, indexing, and querying advanced distributed search engines

Threat Hunting with Elastic Stack: Solve complex security challenges with integrated prevention, detection, and response

The most popular Services mentioned in /r/elasticsearch:

elasticsearch

Kibana

Grafana

Wazuh

Google Groups

DigitalOcean

Wuha

Stack Overflow

Linode

Datadog

Fluentd

Mixpanel

Splunk

Alerta

Asciinema

The most popular reviews in /r/elasticsearch:

What are /r/elasticsearch's favorite Products & Services? From 3.5 billion Reddit comments

The most popular Products mentioned in /r/elasticsearch:

Learning Kibana 7: Build powerful Elastic dashboards with Kibana's data visualization capabilities, 2nd Edition

Elasticsearch: The Definitive Guide: A Distributed Real-Time Search and Analytics Engine

Advanced Elasticsearch 7.0: A practical guide to designing, indexing, and querying advanced distributed search engines

Threat Hunting with Elastic Stack: Solve complex security challenges with integrated prevention, detection, and response

The most popular Services mentioned in /r/elasticsearch:

elasticsearch

Kibana

Grafana

Wazuh

Google Groups

DigitalOcean

Wuha

Stack Overflow

Linode

Datadog

Fluentd

Mixpanel

Splunk

Alerta

Asciinema

The most popular reviews in /r/elasticsearch:

What are /r/elasticsearch's favorite Products & Services?
From 3.5 billion Reddit comments