I’ve always wondered how big companies like Facebook process their information, or how companies like Amazon can make searches in extremely short times. This is why I want to tell you a bit about my experience with two powerful tools they use: Apache Hive and Elasticsearch.
What is Apache Hive and Elasticsearch?
Apache Hive is a data warehouse technology and a part of the ecosystem of Hadoop, closely related to Big Data which allows you to make queries and analyze large amounts of data under HDFS (Hadoop Distributed File System). It has its own query language called HQL (Hive Query Language), a variant of SQL, with the difference that it converts these queries to the MapReduce framework.
On the other hand, Elasticsearch is a RESTful search engine based on Lucene, a high performance text search library, which in turn is based on inverted indices. It is possible for Elasticsearch to do quick searches, because instead of looking for the text directly, it searches by index. An analogy could be to search a chapter of a book by reading the index instead of reading page by page until you find what you are looking for.
Getting Started
Be patient please, there’s some things you should know so that you can understand some examples later.
The base of Apache Hive is the MapReduce framework, which is a combination of the Java functions map() and reduce().
The function map() receives as parameters a key and a value, and then it returns a list of values, for each call to the function. After the map() function, the reduce() function is applied. It acts in parallel for each group created by map() function. Likewise with their respective key and value, the list is passed to the function and these are merged to create a smaller list.
Basic Concepts of Elasticsearch
Here are some basic concepts of Elasticsearch,
- A document is the minimum unit of information which has a key and value (mmm.. Where have I seen this before?)
- An index to order this information and group it by similar characteristics (as example a list of people with name, last name and age)
- Nodes are the computers where Elasticsearch does the searches and saves the information
- Cluster is a collection of nodes
- We could divide the index in parts and these chunks of an index are called shards
- And last but not least, just like backing up information, replicas are copies of shards
Examples Using Apache Hive and Elasticsearch
To clarify the concepts even further, here are some examples that can give you some context.
Managing an External Database with Apache Hive
Let's start with Apache Hive.
In the example below, you can see how to use an external table, for which Hive doesn’t manage storage, to import data from files into Hive. Pretty familiar, isn't it? From the beginning you can see the great similarity that it has with SQL.
CREATE EXTERNAL TABLE accumulo_ck_2(key string, value string)
STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler'
WITH SERDEPROPERTIES (
"accumulo.table.name" = "accumulo_custom",
"accumulo.columns.mapping" = ":rowid,cf:string");
STORED BY is only used as an indicator for the external storage handler, and SERDEPROPERTIES is a Hive feature that passes the information as <key, value> to the specific SerDe interface.
Dividing an Index with Elasticsearch
Now, for Elasticsearch's example, the instruction PUT in the sentence is used to index the JSON document and make it searchable. By default, when you mount a node in Elasticsearch there is only a single shard and one replica per index, but in the example below we are changing the number of shards from 1 to 2 with one replica per index in the settings.
We do this to be prepared for a heavy load of information to index, and we have to delegate the work in smaller pieces to make the searches more efficient and faster. Divide and conquer, it is said over there.
curl -XPUT http://localhost:9200/another_user?pretty -H 'Content-Type: application/json' -d '
{
"settings": {
"index.number_of_shards”: 2,
"index.number_of_replicas": 1
}
}'
https://github.com/elastic/elasticsearch/blob/master/README.asciidoc
My Experience with the Environments
Installing Apache Hive
Regarding the installation of the Hive environment, it is necessary to install Hadoop before Hive, since it is mainly focused on Linux Operating Systems. In my personal experience as a Windows user, I would dare to say that it was especially difficult.
First I tried to do it on Windows, with the help of Apache Derby to create its metastore and Cygwin, a tool to execute linux commands but it did not work as I thought. Definitely the little information I found on the internet, related to Hadoop and Hive with Windows, was a great impediment to continue working that way. In the end I had to find an alternative. I'm not saying it's an impossible thing to do, but it was getting increasingly difficult for me and it was taking me a long time.
My second option, which turned out to be a success, was installing an Ubuntu virtual machine. I only had to carefully run some commands, and it was ready to use. Of course I had to install Hadoop first, and then Hive like I did with Windows. There are four important XML files in the Hadoop folder you need to add some tags to make it work and that would be all. I would say that Hive is one of those platforms where you take more time making the installation work, than actually making the code work.
Installing Elasticsearch
Installing Elastic's environment was much easier than Hive's. And not only with Elasticsearch, but also for Kibana and cerebro, which are part of Elasticsearch’s stack. Kibana allows the search of indexed data, and the visualization of data in elasticsearch, literally I put some commands in the Kibana’s console and automatically it modifies the Elasticsearch information, and cerebro, which is a tool to make the actions performed more visual.
I just had to download some zips, unzip them, run some files, write the respective ports (9200 for elasticsearch, 9000 for kibana and 5601 for cerebro) in my browser et voilà! I could already do my first tests. This is what the cerebro looks like, you can see that there is a node (the computer that is being used), 6 indices, 9 shards and 51 documents.
In the previous image, the yellow line is not a decorative tool, that line indicates the state of Elasticsearch's configuration. There’s a traffic light which indicates if we are doing the things right or not.
For instance, the green color means that everything is fine, that all the shards and its replicas have been assigned to the respective nodes of the cluster. The yellow line is a warning that one or more replicas couldn’t be assigned to any node and the red color means that one or more primary shards haven’t been assigned to any node.
Difference Between Hadoop and Elasticsearch
As I mentioned before, Hadoop is the ecosystem in which Hive is mounted. It’s likely that the functionality of both will be confused since to a certain extent they are very similar when working with large amounts of data.
Don’t worry, here are some of the main differences you will find,
- Hadoop works with parallel data processing with the help of its distributed file system, while Elasticsearch is only the search engine.
- Hadoop is based on the MapReduce framework which is comparatively more complex to understand and implement than Elasticsearch which is based on JSON and hence Domain-specific language.
- Hadoop is for Batch Processing while Elasticsearch makes real-time queries and results.
- Setting up Hadoop clusters is smoother than the more error-prone setting clusters of Elasticsearch.
- Hadoop doesn’t have such advanced searching and analytical search capabilities as Elasticsearch does.
So, it can be deduced that part of the deficiencies of one, are the strengths of the other. And this is why there is now something called Elasticsearch-Hadoop, which combines the best of both. Basically, Hadoop's large amount of storage and powerful data processing with Elasticsearch analytics and real-time searches.
Only with tools like Hive, it was possible for big companies like Facebook or Netflix to start handling large amounts of information. This is no longer possible to do it with traditional methods. Facebook alone has 2,449 million active users around the world (Can you believe it?).
Even Amazon now has an Apache Hive fork included in Amazon Elastic MapReduce to do Amazon Web Services and create and monitor applications. The Elasticsearch user list includes companies such as GitHub, Mozilla and Etsy, just to name a few.
Finally...
Now you know how to identify which of these technologies fit your expectations and which reach your purpose. While one is for storing large amounts of data, the other is for efficient and fast searching.
We cannot forget the fact that the use of Big Data by companies is increasingly significant and it is important to be able to manage all this information. Hive and Elasticsearch are just some of the tools that are being used by some companies, let's take advantage of the fact that they are open source.
Consider what these imply when you want to use them and don’t be afraid to experiment, in the end whether the results are negative or positive, in both cases you will end up with more knowledge than when you started.
I really hope you enjoyed reading this post.