ALSUpdate (Oryx 2.8.0 API)

java.lang.Object
- com.cloudera.oryx.ml.MLUpdate<String>
- - com.cloudera.oryx.app.batch.mllib.als.ALSUpdate

All Implemented Interfaces:

BatchLayerUpdate<Object,String,String>, Serializable
```
public final class ALSUpdate
extends MLUpdate<String>
```
A specialization of MLUpdate that creates a matrix factorization model of its input, using the Alternating Least Squares algorithm.

The implementation is built on Spark MLlib's implementation of ALS, which is in turn based on the paper Collaborative Filtering for Implicit Feedback Datasets. The parameters used below and in the configuration follow this paper as given.

Note that this also adds support for log transformation of strength values, as suggested in Equation 6 of the paper.

See Also:

Serialized Form

Field Summary
- Fields inherited from class com.cloudera.oryx.ml.MLUpdate
  MODEL_FILE_NAME

Constructor Summary

Constructors
Constructor and Description

ALSUpdate(com.typesafe.config.Config config)

Constructors
Constructor and Description
`ALSUpdate(com.typesafe.config.Config config)`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`org.dmg.pmml.PMML`	`buildModel(org.apache.spark.api.java.JavaSparkContext sparkContext, org.apache.spark.api.java.JavaRDD<String> trainData, List<?> hyperParameters, org.apache.hadoop.fs.Path candidatePath)`
`boolean`	`canPublishAdditionalModelData()`
`double`	`evaluate(org.apache.spark.api.java.JavaSparkContext sparkContext, org.dmg.pmml.PMML model, org.apache.hadoop.fs.Path modelParentPath, org.apache.spark.api.java.JavaRDD<String> testData, org.apache.spark.api.java.JavaRDD<String> trainData)`
`List<HyperParamValues<?>>`	`getHyperParameterValues()`
`void`	`publishAdditionalModelData(org.apache.spark.api.java.JavaSparkContext sparkContext, org.dmg.pmml.PMML pmml, org.apache.spark.api.java.JavaRDD<String> newData, org.apache.spark.api.java.JavaRDD<String> pastData, org.apache.hadoop.fs.Path modelParentPath, TopicProducer<String,String> modelUpdateTopic)` Optionally, publish additional model-related information to the update topic, after the model has been written.
`protected Pair<org.apache.spark.api.java.JavaRDD<String>,org.apache.spark.api.java.JavaRDD<String>>`	`splitNewDataToTrainTest(org.apache.spark.api.java.JavaRDD<String> newData)` Implementation which splits based solely on time.

Methods inherited from class com.cloudera.oryx.ml.MLUpdate
getTestFraction, runUpdate

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - ALSUpdate
```
public ALSUpdate(com.typesafe.config.Config config)
```
- Method Detail
  - getHyperParameterValues
```
public List<HyperParamValues<?>> getHyperParameterValues()
```
    Overrides:
    
    getHyperParameterValues in class MLUpdate<String>
    
    Returns:
    
    a list of hyperparameter value ranges to try, one HyperParamValues per hyperparameter. Different combinations of the values derived from the list will be passed back into MLUpdate.buildModel(JavaSparkContext,JavaRDD,List,Path)
  - buildModel
```
public org.dmg.pmml.PMML buildModel(org.apache.spark.api.java.JavaSparkContext sparkContext,
                                    org.apache.spark.api.java.JavaRDD<String> trainData,
                                    List<?> hyperParameters,
                                    org.apache.hadoop.fs.Path candidatePath)
```
    Specified by:
    
    buildModel in class MLUpdate<String>
    
    Parameters:
    
    sparkContext - active Spark Context
    
    trainData - training data on which to build a model
    
    hyperParameters - ordered list of hyper parameter values to use in building model
    
    candidatePath - directory where additional model files can be written
    
    Returns:
    
    a PMML representation of a model trained on the given data
  - evaluate
```
public double evaluate(org.apache.spark.api.java.JavaSparkContext sparkContext,
                       org.dmg.pmml.PMML model,
                       org.apache.hadoop.fs.Path modelParentPath,
                       org.apache.spark.api.java.JavaRDD<String> testData,
                       org.apache.spark.api.java.JavaRDD<String> trainData)
```
    Specified by:
    
    evaluate in class MLUpdate<String>
    
    Parameters:
    
    sparkContext - active Spark Context
    
    model - model to evaluate
    
    modelParentPath - directory containing model files, if applicable
    
    testData - data on which to test the model performance
    
    trainData - data on which model was trained, which can also be useful in evaluating unsupervised learning problems
    
    Returns:
    
    an evaluation of the model on the test data. Higher should mean "better"
  - canPublishAdditionalModelData
```
public boolean canPublishAdditionalModelData()
```
    Overrides:
    
    canPublishAdditionalModelData in class MLUpdate<String>
    
    Returns:
    
    true iff additional updates must be published along with the model; if MLUpdate.publishAdditionalModelData(JavaSparkContext, PMML, JavaRDD, JavaRDD, Path, TopicProducer) must be called. This is only applicable for special model types.
  - publishAdditionalModelData
```
public void publishAdditionalModelData(org.apache.spark.api.java.JavaSparkContext sparkContext,
                                       org.dmg.pmml.PMML pmml,
                                       org.apache.spark.api.java.JavaRDD<String> newData,
                                       org.apache.spark.api.java.JavaRDD<String> pastData,
                                       org.apache.hadoop.fs.Path modelParentPath,
                                       TopicProducer<String,String> modelUpdateTopic)
```
    Description copied from class: MLUpdate
    
    Optionally, publish additional model-related information to the update topic, after the model has been written. This is needed only in specific cases, like the ALS algorithm, where the model serialization in PMML can't contain all of the info.
    
    Overrides:
    
    publishAdditionalModelData in class MLUpdate<String>
    
    Parameters:
    
    sparkContext - active Spark Context
    
    pmml - model for which extra data should be written
    
    newData - data that has arrived in current interval
    
    pastData - all previously-known data (may be null)
    
    modelParentPath - directory containing model files, if applicable
    
    modelUpdateTopic - message topic to write to
  - splitNewDataToTrainTest
```
protected Pair<org.apache.spark.api.java.JavaRDD<String>,org.apache.spark.api.java.JavaRDD<String>> splitNewDataToTrainTest(org.apache.spark.api.java.JavaRDD<String> newData)
```
    Implementation which splits based solely on time. It will return approximately the earliest MLUpdate.getTestFraction() of input, ordered by timestamp, as new training data and the rest as test data.
    
    Overrides:
    
    splitNewDataToTrainTest in class MLUpdate<String>
    
    Parameters:
    
    newData - data that has arrived in the current input batch
    
    Returns:
    
    a Pair of train, test RDDs.

Class ALSUpdate

Field Summary

Fields inherited from class com.cloudera.oryx.ml.MLUpdate

Constructor Summary

Method Summary

Methods inherited from class com.cloudera.oryx.ml.MLUpdate

Methods inherited from class java.lang.Object

Constructor Detail

ALSUpdate

Method Detail

getHyperParameterValues

buildModel

evaluate

canPublishAdditionalModelData

publishAdditionalModelData

splitNewDataToTrainTest