Apache Spark MLlib

Explore Apache Spark MLlib for distributed machine learning. Use scalable algorithms for classification, regression, and more on big data. Build ML pipelines for feature engineering using Spark's built-in library.

intermediate30 min5 steps

The play

Set up Spark Environment
Ensure you have Apache Spark installed and configured. You can download Spark from the official website or use a cloud-based Spark environment like Databricks.
Load Data into Spark
Load your data into a Spark DataFrame. This example reads a CSV file, but you can adapt it for other formats.
Feature Engineering with MLlib
Use MLlib's feature transformers to prepare your data. This example uses VectorAssembler to combine multiple columns into a single feature vector.
Train a Machine Learning Model
Train a machine learning model using MLlib. This example trains a Logistic Regression model.
Evaluate the Model
Evaluate the trained model using MLlib's evaluation metrics.

Starter code

import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

val spark = SparkSession.builder().appName("MLlibExample").master("local[*]").getOrCreate()

val data = spark.read.option("header", "true").option("inferSchema", "true").csv("path/to/your/data.csv")

val assembler = new VectorAssembler()
  .setInputCols(Array("feature1", "feature2", "feature3"))
  .setOutputCol("features")

val output = assembler.transform(data)

val lr = new LogisticRegression()
  .setLabelCol("label")
  .setFeaturesCol("features")

val model = lr.fit(output)

val predictions = model.transform(output)

val evaluator = new BinaryClassificationEvaluator()
  .setLabelCol("label")
  .setRawPredictionCol("rawPrediction")
  .setMetricName("areaUnderROC")

val accuracy = evaluator.evaluate(predictions)
println(s"Area under ROC = ${accuracy}")

Source

Articlespark.apache.org