Preprocessing of Big Data

The need to derive meaningful insights into customer or business data has led to a massive growth in the volume of collected data in the recent years. Deriving insights from data is a key factor in the evolution of Big Data technology. Big data can be defined as the accumulation of high volume, velocity, and variety of data that cannot be directly treated or analyzed by humans and is too large and complex to be processed by traditional database management tools.
Collecting and managing data is a challenging and a time-consuming task. Not only does raw data lack consistency, the data collection process also tends to be sub-optimal for the purpose of analytics. Preprocessing of raw data is a must as a first step to ensure further data processing and analysis.
So, what is data preprocessing? Methods and techniques used to discover knowledge from the data before the data mining process is termed as Data Preprocessing. As data is most likely to be imperfect, inconsistent and sometimes redundant it cannot be directly used in the Data Mining process. Data preprocessing stage enables us to process data by adapting it to the requirements presented by each data mining algorithm.

Source: University of Regina

What are the general steps in pre-processing?

Noise Identification: The veracity of data depends on identifying and removing additional meaningless information present in data. The noise can also be random fluctuations in data values that harm analytic predictions.
Data Cleaning: Detecting and correcting or removing corrupt or inaccurate records in raw data by removing typographical errors and validating values against a known list of entities. Noisy data is also smoothened out or removed in this stage.
Data Normalization: Where skew in data is removed by transforming all variables in data to specific range for better analytics processing.
Data Transformation: Converting data from one format or structure to another format suitable for integration, warehousing and wrangling.
Data Integration: Combining data from multiple sources into an integrated view. As volume of data increases and more machine learning algorithms are applied, data integration becomes more critical.

What are the types of frameworks and platforms available?
Many large-scale processing platforms are built with distributed technologies that wrap the technical complexity associated with managing distributed data, merging aggregated data from different data sources, etc. into a unified system that can be used by software developers and data scientists. Despite this advantage, the platform usually requires complex algorithms and deployment considerations. Also, Big Data platforms require additional algorithms that give support to relevant tasks, like big data preprocessing and analytics. Standard algorithms for these tasks should also be redesigned (sometimes, entirely) if we want to learn from large-scale datasets.
The first framework that enabled the processing of large-scale datasets was MapReduce, introduced around 2003. This revolutionary tool was intended to process and generate huge datasets in an automatic, and a distributed way. Technical nuances such as failure recovery, data partitioning or job communication can be easily avoided by implementing two primitives in Map and Reduce. With this, the user can use a scalable and distributed tool without worrying about the underlying complexity.
Apache Hadoop emerged as the most popular open-source platform that supported an implementation of MapReduce, which had all the features mentioned previously like fault tolerance, failure recovery, data partitioning, inter-job communication etc. However, MapReduce (and Hadoop) do not scale well when dealing with streaming data that are generated in iterative and online processes, for use with machine learning and analytics.
As an alternative to Hadoop, Apache Spark was created, capable of performing faster-distributed computing tasks by using in-memory techniques. Spark overcame the problem of iterative, online processing and frequent IO to disc made by MapReduce. It loads data into memory and reuses it repeatedly.
Spark is a general-purpose framework. This allows us to implement several distributed programming models on top of it (like Pregel or HaLoop). Spark is built on top of a new abstraction model called Resilient Distributed Datasets (RDDs). This versatile model allows controlling the persistence and managing the partitioning of data, among other features.
Some competitors to Apache Spark have emerged lately, especially for data stream processing. Apache Storm is a prime candidate. It is an open-source distributed real-time processing platform, which is capable of processing millions of tuples per second per node in a fault-tolerant way. Apache Flink is a recent top-level Apache project designed for distributed stream and batch data processing. Spark employes a mini-batch streaming processing over online or real-time processing approach. This gap is now filled by Storm and Flink.
Closing thoughts
For any of the frameworks discussed above, performance and quality of the insights extracted from data mining not only depend on the design, and the performance of the method used, but also on the quality and suitability of the dataset. Factors such as noise, missing values, inconsistent and redundant data and its sizes in examples and features, highly influence the dataset used to learn and extract meaning from. Low-quality data will lead to low-quality knowledge. Data preprocessing is an important stage. It ensures that the final datasets arrived at can be considered accurate for use in data mining algorithms.