Article·spark.apache.org
distributed-mlbig-datasparkfeature-engineeringscala
Apache Spark MLlib
Explore Apache Spark MLlib for distributed machine learning. Use scalable algorithms for classification, regression, and more on big data. Build ML pipelines for feature engineering using Spark's built-in library.
intermediate30 min5 steps
The play
- Set up Spark EnvironmentEnsure you have Apache Spark installed and configured. You can download Spark from the official website or use a cloud-based Spark environment like Databricks.
- Load Data into SparkLoad your data into a Spark DataFrame. This example reads a CSV file, but you can adapt it for other formats.
- Feature Engineering with MLlibUse MLlib's feature transformers to prepare your data. This example uses VectorAssembler to combine multiple columns into a single feature vector.
- Train a Machine Learning ModelTrain a machine learning model using MLlib. This example trains a Logistic Regression model.
- Evaluate the ModelEvaluate the trained model using MLlib's evaluation metrics.
Starter code
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
val spark = SparkSession.builder().appName("MLlibExample").master("local[*]").getOrCreate()
val data = spark.read.option("header", "true").option("inferSchema", "true").csv("path/to/your/data.csv")
val assembler = new VectorAssembler()
.setInputCols(Array("feature1", "feature2", "feature3"))
.setOutputCol("features")
val output = assembler.transform(data)
val lr = new LogisticRegression()
.setLabelCol("label")
.setFeaturesCol("features")
val model = lr.fit(output)
val predictions = model.transform(output)
val evaluator = new BinaryClassificationEvaluator()
.setLabelCol("label")
.setRawPredictionCol("rawPrediction")
.setMetricName("areaUnderROC")
val accuracy = evaluator.evaluate(predictions)
println(s"Area under ROC = ${accuracy}")Source