Relational databases have dominated the industry for the past few decades. They store the data in tables of rows and columns and logical connection between data entities is established by linking these tables. All Relational databases use SQL- conventional way for storing and retrieving data. MS SQL, Oracle, MySQL, Postgres etc. are some popular Relational database systems. These traditional (relational) databases have been very popular and perfect choice for storing the ‘related’ data, these fits well with modeling real-world business entities (e.g., customers, orders, products, etc.) and the business relationships that exist between them. But the principle of ‘no one size fits all solution’ is applicable to world of databases as well. Relational databases may not be the ideal choice with the huge quantity of data as searches become slow, scaling becomes a big problem and most importantly, not all types of data can be stored in the form of rows and columns.
Polyglot Persistence and No-SQL Databases
Modern applications are now following ‘Polyglot Persistence’, which means using different database technologies for varying data storage needs based on the type/ structure/ quantity/ease of retrieval/ business use cases. No-SQL databases are very popular in this new architecture because it gives flexibility to choose appropriate database based on the application needs.
- No- SQL means ‘non-SQL’ databases which follows non-traditional mechanism for storage and retrieval of data.
- No- SQL databases use non-SQL modern ways which gives high operational speed and flexibility.
- Each No-SQL database has its own way of data retrieval and storage. Redis (Key-value storage), MongoDB (document-based storage), Apache Cassandra (column database) and Elasticsearch (another document based) are a few popular No-SQL databases.
What is Elasticsearch?
Elasticsearch (ES) is one of the most popular No-SQL, document distributed database and Search Engine. Let's split this sentence and see what each word signifies.
1. No-SQL: We have already covered this in detail above.
2. Document databases means No-SQL databases which store data in the form of documents. Documents here means JSON documents.
- Since data is stored in the JSON form you can easily maintain the structure and retain the hierarchy of complex data.
- Another advantage of storing data in the JSON form is that you can have properties with various datatypes in a single document.
- Below is the sample JSON document from ES showing person’s details-
This ES JSON document has different properties with multiple datatypes- string, Boolean, number, array, object all types in the one document. This one document maps to a record/ row of relational database. Also, these JSON documents are part of an Index in ElasticSearch. Index is simply a logical namespace to store related documents together. Below table shows the mapping between relational and ES.
Relation Database | ElasticSearch |
Table | Index |
Row | Document |
Column | Property/ Field |
3. Distributed: ES has distributed architecture which means internally/ physically data is stored on the different machines which are part of the same cluster. We have a dedicated section to explore this part.
4. Search Engine: Quick search is the main USP (unique selling point) of Elasticsearch which differentiates it from the other No-SQL databases.
- Querying a large SQL database can easily take 10- 20 seconds. Similar queries on a large ElasticSearch database will return results in milliseconds.
- Elasticsearch is the perfect choice for all the search-based applications.
- Searching is so fast in Elasticsearch mainly because of how it stores data internally.
Why is it so fast?
We will now dive deep into how data is stored internally. ‘Inverted index’ is the underlying data structure in ES which makes data retrieval very fast. Inverted Index- core data structure of ES is very similar to the index page on back of a book. This back index page contains an alphabetically sorted list of all the important words along with the page numbers and chapters where the word was referred. ES does the same, it stores data in the Inverted Index format.
Data (read Document) fed in ES is first split into set of tokens, words or terms. This process is called Tokenization. Most common way of doing Tokenization is splitting on the white spaces. Next, ES performs filtering on these words/ tokens. Below are some standard filter operations-
- Removing stop words- Stop words means not- so- popular words like is, in the etc.
- Lower-casing of all the words
- Removing synonyms
- Performs stemming- Get to the root of word. Like ‘Jumping’, ‘Jumped’, ‘Jumps’ all becomes ‘jump’.
This whole process of Tokenization and Filtering is called Text Analysis. It is the process of converting unstructured text, like the body of an email or a product description, into a structured format that’s optimized for search.
ES has many built-in Analyzers, and each analyzer has its own way of doing text analysis. Below table will give you an idea how Text analysis works in ES-
Text Analysis in ES- Documents’ data stored in Inverted Index.
The above diagram depicts how the documents fed to ES are stored in the inverted index format. So, if a user searches for any word/ term, ES simply scans the sorted list of words and returns list of all the documents where that word occurs. O (1), also called constant time complexity, is the best-case complexity to search a word in the ES.
ElasticSearch- Behind the scenes
Data in ElasticSearch is organized into indices. Index is like ‘table’ in relational databases, it is a logical namespace to store all related documents together. ‘Index’ is just a concept, but behind the scenes index points to one or various ‘physical’ shards. Shard is low level worker unit that holds a slice of all your data in indices. We do not query shards directly, instead we interact with indices. Shards are containers of data, and all our documents are internally stored in shards. Shard could be either primary or replica shard (secondary shards for backup). Further these shards are allocated to nodes in a cluster. Node here is simply a machine or server which is part of a cluster. Cluster is basically a group of node instances which are connected.
I know, all this ‘index’, ‘shard’, ‘node’, ‘cluster’ would be too overwhelming by now. Maybe this diagram below would make it easier for you-
- Each node has a unique identifier or name used for management purposes.
- ES stores data in a distributed manner spread across different nodes.
- User can send requests to any node as every node on the cluster knows the location of all other nodes and knows the documents stored on them.
- User sends request to any node and based on the document ID node runs hashing algorithm and determines the location of that document and then re-routes the request to the correct node.
- Get/ Read requests are fulfilled in the Round Robin fashion to avoid overloading single node.
Key features of ES
Below are some key features of the ES-
- Fault Tolerant: ES follows a distributed architecture which means data is internally stored across multiple nodes on a cluster. Each node has primary as well as replica/ secondary shards. This distributed nature of data avoids a single point of failure and makes ES fault tolerant.
- Free & open source: Basic features are free. If you need security and alerting features, you need to buy the commercial X-pack subscription.
- RESTful API: Query results are returned in JSON format which means results are easy to work with. Querying/ Inserting data via the RESTful API means one can use any JSON compatible programming language to work with ES.
- Easy to Query: ElasticSearch is based on the Apache Lucene. Lucene is easy to query, and so even non-technical people would be able to write common queries.
- Very fast Search: Already covered above.
- Easy to Set Up: One can easily download the setup from the official website. You can also download its image and launch it as a docker container.
Let's wind up our first part here. In this post we have covered all the theoretical concepts of ES- What is Elasticsearch, Inverted Index- core data structure of ES, why it is so fast, and behind the scenes of how data is stored on shards spread across different nodes in a cluster, we talked about different features of ES.
We will dive deep into practical concepts of ES in the next part. We will see how to install it on your local machine, how to perform CRUD operations in ES and what are the various ways of querying data in ES. We will also talk about Kibana and how to use Kibana- Dev tools console for interacting with ES data.
Watch this space for more. Happy learning!!
References-
- https://fauna.com/blog/relational-database
- https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_add_an_index.html
About Encora
Fast-growing tech companies partner with Encora to outsource product development and drive growth. Contact us to learn more about our software engineering capabilities.