As a machine learning engineer, I have to be current on the various languages that are in use to build ML models. I came across Julia through peer discussions and readings. In this post, I will share a brief comparison of Julia and python to implement machine learning models.
Julia and Python
Julia is a high-level, high-performance, dynamic programming language. While it is a general-purpose language and can be used to write any application, Julia shines in numerical analysis and computational science.
Python is a powerful general-purpose programming language. Developers use Python for web development, data science, creating software prototypes, and other similar purposes. With its easy to learn syntax, Python is the chosen programming language for beginners.
Several popular products of the tech age are written in Python such as Dropbox, Spotify, Instagram, Reddit, Uber, etc.
Comparison use case: APS Failure at Scania Trucks Data Set
Business Problem
The air pressure system (APS) plays a critical role in heavy Scania trucks. APS generates pressurized air that is used in various critical functions such as braking and gear changing.
Accurate prediction of the failure status of APS based on the measurements of truck mechanical system attributes can significantly reduce the operational cost of the truck fleet.
Data Set Description
The dataset consists of sensor measurement data from Scania trucks in everyday use. This system is the Air Pressure system (APS) that generates pressurized air that assists in functions such as braking and gear changes.
The positive category of the data set includes component failures of specific components of the APS system. The negative category includes trucks whose failures are not related to APS. The data is selected by experts. The dataset contains a subset of all available data.
The training set contains 60,000 examples in a total of which 59,000 are negative cases and 1,000 positive cases. The test set contains 16000 examples. There are 171 attributes per record.
For proprietary reasons, the attribute names in the data have been anonymized. It consists of a single digital counter and a histogram. The histogram consists of boxes with different conditions. Usually, both ends of the histogram have open conditions. For example, if we measure the ambient temperature “T” then the histogram could be defined with four bins where the attributes are classes and the operational data is anonymized and unknown to the user. The operational data have an identifier and a bin id, such as “Identifier_Bin“. Of the 171 attributes, 7 are histogram variables. Missing values are denoted by “na“.
The goal of the model is to accurately predict APS failure.
Python vs Julia: Implementation details and results
This is a binary classification problem. In machine learning algorithm classification is the task of how to assign class labels to items or examples from the problem.
Here we are going to classify the success and failure of the Air Pressure System.
Some algorithms specifically do binary classification and do not natively support more than two classes. Logistic Regression and Support Vector Machines are examples of such algorithms. The dataset contains up to 82% missing values per attribute. Furthermore, many of the attributes contain outliers. Therefore, the mean values replace these missing values.
Python Implementation
While fitting a logistic regression model on the available dataset we use the grid searching of hyperparameters technique. Grid search is a method of hyperparameter adjustment. It will systematically build and evaluate models for each combination of algorithm parameters specified in the grid.
The following image shows how we fit this model on the APS dataset
Here we fit the logistic regression model on the available data with 99% accuracy. Shown below are results and comparisons in two different languages. The time to fit this model is 146.85 secs utilizing 1.5 MByte memory. Now Let’s look at the implementation of the same model in the Julia language.
Julia uses the same dataset and methods for fitting the model. Here we use Mean imputation for preprocessing the data and replacing missing values. It simply calculates the mean of the observed values for that variable for all individuals who are non-missing. In the given dataset the data is not well balanced and there are very low positive records.
We used the same method in both Julia and python language. The following image shows the implementation details of the logistic model which was implemented in Julia.
However, the model built with Julia shows the same results as the Python implementation. But it gives a big difference in terms of time and memory. you can see that in the following image. Julia takes less time than python for the above example – 126.85 seconds as compared to Python’s 146.85. This is one of the significant differences between Julia vs Python. Julia is faster because Julia is not interpreted, it is also compiled at Just-In-Time or run time using the LLVM framework.
Conclusion
- Speed. The above example shows that Julia is faster than Python with speeds coming close to that of C language.
- Community. Python is older and more popular than Julia and has greater community support.
- Code Conversion. Julia is easy to code and converts from C codebases as compared to Python
References
- Air pressure system failures in Scania trucks. (n.d.). Kaggle.Com. Retrieved August 25, 2020, from https://www.kaggle.com/uciml/aps-failure-at-scania-trucks-data-set
- Bezanson, J. (2019). The Julia Language. Julialang.Org. https://julialang.org
- Gondek, C., Hafner, D., & Sampson, O. (2016, October). Prediction of Failures in the Air Pressure System of Scania Trucks Using a Random Forest and Feature Engineering. https://www.researchgate.net/publication/309195602_Prediction_of_Failures_in_the_Air_Pressure_System_of_Scania_Trucks_Using_a_Random_Forest_and_Feature_Engineering
- Industrial Challenge. (n.d.). Ida2016.Blogs.Dsv.Su.Se. Retrieved August 25, 2020, from https://ida2016.blogs.dsv.su.se/?page_id=1387
- Lindgren, T., & Biteus, J. (2016, September). UCI Machine Learning Repository: APS Failure at Scania Trucks Data Set. Archive.Ics.Uci.Edu. https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks
- Paul, S. (2018, August 15). Hyperparameter Optimization in Machine Learning Models. DataCamp.
- Python.org. (2019, May 29). Python.Org; Python.org. https://www.python.org
Quick start guide – ScikitLearn.jl. (n.d.). Scikitlearnjl.Readthedocs.Io. Retrieved August 25, 2020, from https://scikitlearnjl.readthedocs.io/en/latest/quickstart
sklearn.linear_model.LogisticRegression — scikit-learn 0.21.2 documentation. (2014). Scikit-Learn.Org. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
If you liked this post, here are a few more that you may enjoy reading,