public final class ALSUpdate extends MLUpdate<String>
A specialization of MLUpdate
that creates a matrix factorization model of its
input, using the Alternating Least Squares algorithm.
The implementation is built on Spark MLlib's implementation of ALS, which is in turn based on the paper Collaborative Filtering for Implicit Feedback Datasets. The parameters used below and in the configuration follow this paper as given.
Note that this also adds support for log transformation of strength values, as suggested in Equation 6 of the paper.
MODEL_FILE_NAME
Constructor and Description |
---|
ALSUpdate(com.typesafe.config.Config config) |
Modifier and Type | Method and Description |
---|---|
org.dmg.pmml.PMML |
buildModel(org.apache.spark.api.java.JavaSparkContext sparkContext,
org.apache.spark.api.java.JavaRDD<String> trainData,
List<?> hyperParameters,
org.apache.hadoop.fs.Path candidatePath) |
boolean |
canPublishAdditionalModelData() |
double |
evaluate(org.apache.spark.api.java.JavaSparkContext sparkContext,
org.dmg.pmml.PMML model,
org.apache.hadoop.fs.Path modelParentPath,
org.apache.spark.api.java.JavaRDD<String> testData,
org.apache.spark.api.java.JavaRDD<String> trainData) |
List<HyperParamValues<?>> |
getHyperParameterValues() |
void |
publishAdditionalModelData(org.apache.spark.api.java.JavaSparkContext sparkContext,
org.dmg.pmml.PMML pmml,
org.apache.spark.api.java.JavaRDD<String> newData,
org.apache.spark.api.java.JavaRDD<String> pastData,
org.apache.hadoop.fs.Path modelParentPath,
TopicProducer<String,String> modelUpdateTopic)
Optionally, publish additional model-related information to the update topic,
after the model has been written.
|
protected Pair<org.apache.spark.api.java.JavaRDD<String>,org.apache.spark.api.java.JavaRDD<String>> |
splitNewDataToTrainTest(org.apache.spark.api.java.JavaRDD<String> newData)
Implementation which splits based solely on time.
|
getTestFraction, runUpdate
public List<HyperParamValues<?>> getHyperParameterValues()
getHyperParameterValues
in class MLUpdate<String>
HyperParamValues
per
hyperparameter. Different combinations of the values derived from the list will be
passed back into MLUpdate.buildModel(JavaSparkContext,JavaRDD,List,Path)
public org.dmg.pmml.PMML buildModel(org.apache.spark.api.java.JavaSparkContext sparkContext, org.apache.spark.api.java.JavaRDD<String> trainData, List<?> hyperParameters, org.apache.hadoop.fs.Path candidatePath)
buildModel
in class MLUpdate<String>
sparkContext
- active Spark ContexttrainData
- training data on which to build a modelhyperParameters
- ordered list of hyper parameter values to use in building modelcandidatePath
- directory where additional model files can be writtenPMML
representation of a model trained on the given datapublic double evaluate(org.apache.spark.api.java.JavaSparkContext sparkContext, org.dmg.pmml.PMML model, org.apache.hadoop.fs.Path modelParentPath, org.apache.spark.api.java.JavaRDD<String> testData, org.apache.spark.api.java.JavaRDD<String> trainData)
evaluate
in class MLUpdate<String>
sparkContext
- active Spark Contextmodel
- model to evaluatemodelParentPath
- directory containing model files, if applicabletestData
- data on which to test the model performancetrainData
- data on which model was trained, which can also be useful in evaluating
unsupervised learning problemspublic boolean canPublishAdditionalModelData()
canPublishAdditionalModelData
in class MLUpdate<String>
true
iff additional updates must be published along with the model; if
MLUpdate.publishAdditionalModelData(JavaSparkContext, PMML, JavaRDD, JavaRDD, Path, TopicProducer)
must
be called. This is only applicable for special model types.public void publishAdditionalModelData(org.apache.spark.api.java.JavaSparkContext sparkContext, org.dmg.pmml.PMML pmml, org.apache.spark.api.java.JavaRDD<String> newData, org.apache.spark.api.java.JavaRDD<String> pastData, org.apache.hadoop.fs.Path modelParentPath, TopicProducer<String,String> modelUpdateTopic)
MLUpdate
publishAdditionalModelData
in class MLUpdate<String>
sparkContext
- active Spark Contextpmml
- model for which extra data should be writtennewData
- data that has arrived in current intervalpastData
- all previously-known data (may be null
)modelParentPath
- directory containing model files, if applicablemodelUpdateTopic
- message topic to write toprotected Pair<org.apache.spark.api.java.JavaRDD<String>,org.apache.spark.api.java.JavaRDD<String>> splitNewDataToTrainTest(org.apache.spark.api.java.JavaRDD<String> newData)
MLUpdate.getTestFraction()
of input, ordered by timestamp, as new training
data and the rest as test data.splitNewDataToTrainTest
in class MLUpdate<String>
newData
- data that has arrived in the current input batchPair
of train, test RDD
s.Copyright © 2014–2018. All rights reserved.