public final class ALSUpdate extends MLUpdate<String>
A specialization of MLUpdate that creates a matrix factorization model of its
input, using the Alternating Least Squares algorithm.
The implementation is built on Spark MLlib's implementation of ALS, which is in turn based on the paper Collaborative Filtering for Implicit Feedback Datasets. The parameters used below and in the configuration follow this paper as given.
Note that this also adds support for log transformation of strength values, as suggested in Equation 6 of the paper.
MODEL_FILE_NAME| Constructor and Description |
|---|
ALSUpdate(com.typesafe.config.Config config) |
| Modifier and Type | Method and Description |
|---|---|
org.dmg.pmml.PMML |
buildModel(org.apache.spark.api.java.JavaSparkContext sparkContext,
org.apache.spark.api.java.JavaRDD<String> trainData,
List<?> hyperParameters,
org.apache.hadoop.fs.Path candidatePath) |
boolean |
canPublishAdditionalModelData() |
double |
evaluate(org.apache.spark.api.java.JavaSparkContext sparkContext,
org.dmg.pmml.PMML model,
org.apache.hadoop.fs.Path modelParentPath,
org.apache.spark.api.java.JavaRDD<String> testData,
org.apache.spark.api.java.JavaRDD<String> trainData) |
List<HyperParamValues<?>> |
getHyperParameterValues() |
void |
publishAdditionalModelData(org.apache.spark.api.java.JavaSparkContext sparkContext,
org.dmg.pmml.PMML pmml,
org.apache.spark.api.java.JavaRDD<String> newData,
org.apache.spark.api.java.JavaRDD<String> pastData,
org.apache.hadoop.fs.Path modelParentPath,
TopicProducer<String,String> modelUpdateTopic)
Optionally, publish additional model-related information to the update topic,
after the model has been written.
|
protected Pair<org.apache.spark.api.java.JavaRDD<String>,org.apache.spark.api.java.JavaRDD<String>> |
splitNewDataToTrainTest(org.apache.spark.api.java.JavaRDD<String> newData)
Implementation which splits based solely on time.
|
getTestFraction, runUpdatepublic List<HyperParamValues<?>> getHyperParameterValues()
getHyperParameterValues in class MLUpdate<String>HyperParamValues per
hyperparameter. Different combinations of the values derived from the list will be
passed back into MLUpdate.buildModel(JavaSparkContext,JavaRDD,List,Path)public org.dmg.pmml.PMML buildModel(org.apache.spark.api.java.JavaSparkContext sparkContext,
org.apache.spark.api.java.JavaRDD<String> trainData,
List<?> hyperParameters,
org.apache.hadoop.fs.Path candidatePath)
buildModel in class MLUpdate<String>sparkContext - active Spark ContexttrainData - training data on which to build a modelhyperParameters - ordered list of hyper parameter values to use in building modelcandidatePath - directory where additional model files can be writtenPMML representation of a model trained on the given datapublic double evaluate(org.apache.spark.api.java.JavaSparkContext sparkContext,
org.dmg.pmml.PMML model,
org.apache.hadoop.fs.Path modelParentPath,
org.apache.spark.api.java.JavaRDD<String> testData,
org.apache.spark.api.java.JavaRDD<String> trainData)
evaluate in class MLUpdate<String>sparkContext - active Spark Contextmodel - model to evaluatemodelParentPath - directory containing model files, if applicabletestData - data on which to test the model performancetrainData - data on which model was trained, which can also be useful in evaluating
unsupervised learning problemspublic boolean canPublishAdditionalModelData()
canPublishAdditionalModelData in class MLUpdate<String>true iff additional updates must be published along with the model; if
MLUpdate.publishAdditionalModelData(JavaSparkContext, PMML, JavaRDD, JavaRDD, Path, TopicProducer) must
be called. This is only applicable for special model types.public void publishAdditionalModelData(org.apache.spark.api.java.JavaSparkContext sparkContext,
org.dmg.pmml.PMML pmml,
org.apache.spark.api.java.JavaRDD<String> newData,
org.apache.spark.api.java.JavaRDD<String> pastData,
org.apache.hadoop.fs.Path modelParentPath,
TopicProducer<String,String> modelUpdateTopic)
MLUpdatepublishAdditionalModelData in class MLUpdate<String>sparkContext - active Spark Contextpmml - model for which extra data should be writtennewData - data that has arrived in current intervalpastData - all previously-known data (may be null)modelParentPath - directory containing model files, if applicablemodelUpdateTopic - message topic to write toprotected Pair<org.apache.spark.api.java.JavaRDD<String>,org.apache.spark.api.java.JavaRDD<String>> splitNewDataToTrainTest(org.apache.spark.api.java.JavaRDD<String> newData)
MLUpdate.getTestFraction() of input, ordered by timestamp, as new training
data and the rest as test data.splitNewDataToTrainTest in class MLUpdate<String>newData - data that has arrived in the current input batchPair of train, test RDDs.Copyright © 2014–2018. All rights reserved.