Skip to main content
Article·spark.apache.org
distributed-mlbig-datasparkfeature-engineeringscala

Apache Spark MLlib

Explore Apache Spark MLlib for distributed machine learning. Use scalable algorithms for classification, regression, and more on big data. Build ML pipelines for feature engineering using Spark's built-in library.

intermediate30 min5 steps
The play
  1. Set up Spark Environment
    Ensure you have Apache Spark installed and configured. You can download Spark from the official website or use a cloud-based Spark environment like Databricks.
  2. Load Data into Spark
    Load your data into a Spark DataFrame. This example reads a CSV file, but you can adapt it for other formats.
  3. Feature Engineering with MLlib
    Use MLlib's feature transformers to prepare your data. This example uses VectorAssembler to combine multiple columns into a single feature vector.
  4. Train a Machine Learning Model
    Train a machine learning model using MLlib. This example trains a Logistic Regression model.
  5. Evaluate the Model
    Evaluate the trained model using MLlib's evaluation metrics.
Starter code
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

val spark = SparkSession.builder().appName("MLlibExample").master("local[*]").getOrCreate()

val data = spark.read.option("header", "true").option("inferSchema", "true").csv("path/to/your/data.csv")

val assembler = new VectorAssembler()
  .setInputCols(Array("feature1", "feature2", "feature3"))
  .setOutputCol("features")

val output = assembler.transform(data)

val lr = new LogisticRegression()
  .setLabelCol("label")
  .setFeaturesCol("features")

val model = lr.fit(output)

val predictions = model.transform(output)

val evaluator = new BinaryClassificationEvaluator()
  .setLabelCol("label")
  .setRawPredictionCol("rawPrediction")
  .setMetricName("areaUnderROC")

val accuracy = evaluator.evaluate(predictions)
println(s"Area under ROC = ${accuracy}")
Source
Apache Spark MLlib — Action Pack