Yifan Chen and Wenhao Ruan, Databricks

Thursday, April 18, 2019 (11:00 am)

Databricks is a unified analytics engine that allows rapid development of data science applications using machine learning techniques, such as classification, linear and nonlinear regression, clustering, etc. Existence of myriad sophisticated computation options, however, can become overwhelming for designers as it may not always be clear what choices can produce the best predictive model given a specific data set. Further, the mere high dimensionality of big data sets is a challenge for data scientist to gain a deep understanding of the results obtained by a utilized model.

Our research provides general guidelines for utilizing a variety of machine learning algorithms on the cloud computing platform, Databricks. Visualization is an important means for users to understand the significance of the underlying data. Therefore, it is also demonstrated how graphical tools, such as Tableau, can be used to efficiently examine results of classification or clustering. The dimensionality reduction techniques, such as Principal Component Analysis (PCA), which help reduce the number of features in a learning experiment, are also discussed.

To demonstrate the utility of Databricks tools, tow big data sets are used for performing clustering and classification. A variety of machine learning algorithms are applied to both data sets and it is shown how to obtain the most accurate learning models employing appropriate evaluation methods. During the presentation, we will introduce the workflow of conducting an ML model training and describe the method to choose the proper classification and regression algorithms.

One of the data sets will be chosen to demonstrate how we implemented unsupervised learning (K-means) on an unlabeled data set for classification (Kernel S V M) We will also briefly discuss model evaluation and time efficiency. Finally, we will present the visualization of classification after applying PCA.