MLUpdate (Oryx 2.8.0 API)

java.lang.Object
- com.cloudera.oryx.ml.MLUpdate<M>

Type Parameters:

M - type of message to read from the input topic

All Implemented Interfaces:

BatchLayerUpdate<Object,M,String>, Serializable

Direct Known Subclasses:

ALSUpdate, KMeansUpdate, RDFUpdate
```
public abstract class MLUpdate<M>
extends Object
implements BatchLayerUpdate<Object,M,String>
```
A specialization of BatchLayerUpdate for machine learning-oriented update processes. This implementation contains the framework for test/train split for example, parameter optimization, and so on. Subclasses instead implement methods like buildModel(JavaSparkContext,JavaRDD,List,Path) to create a PMML model and evaluate(JavaSparkContext,PMML,Path,JavaRDD,JavaRDD) to evaluate a model from held-out test data.

See Also:

Serialized Form

Field Summary

Fields
Modifier and Type Field and Description

static String MODEL_FILE_NAME

Fields
Modifier and Type	Field and Description
`static String`	`MODEL_FILE_NAME`

Constructor Summary

Constructors
Modifier Constructor and Description

protected MLUpdate(com.typesafe.config.Config config)

Constructors
Modifier	Constructor and Description
`protected`	`MLUpdate(com.typesafe.config.Config config)`

Method Summary

All Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`abstract org.dmg.pmml.PMML`	`buildModel(org.apache.spark.api.java.JavaSparkContext sparkContext, org.apache.spark.api.java.JavaRDD<M> trainData, List<?> hyperParameters, org.apache.hadoop.fs.Path candidatePath)`
`boolean`	`canPublishAdditionalModelData()`
`abstract double`	`evaluate(org.apache.spark.api.java.JavaSparkContext sparkContext, org.dmg.pmml.PMML model, org.apache.hadoop.fs.Path modelParentPath, org.apache.spark.api.java.JavaRDD<M> testData, org.apache.spark.api.java.JavaRDD<M> trainData)`
`List<HyperParamValues<?>>`	`getHyperParameterValues()`
`protected double`	`getTestFraction()`
`void`	`publishAdditionalModelData(org.apache.spark.api.java.JavaSparkContext sparkContext, org.dmg.pmml.PMML pmml, org.apache.spark.api.java.JavaRDD<M> newData, org.apache.spark.api.java.JavaRDD<M> pastData, org.apache.hadoop.fs.Path modelParentPath, TopicProducer<String,String> modelUpdateTopic)` Optionally, publish additional model-related information to the update topic, after the model has been written.
`void`	`runUpdate(org.apache.spark.api.java.JavaSparkContext sparkContext, long timestamp, org.apache.spark.api.java.JavaPairRDD<Object,M> newKeyMessageData, org.apache.spark.api.java.JavaPairRDD<Object,M> pastKeyMessageData, String modelDirString, TopicProducer<String,String> modelUpdateTopic)`
`protected Pair<org.apache.spark.api.java.JavaRDD<M>,org.apache.spark.api.java.JavaRDD<M>>`	`splitNewDataToTrainTest(org.apache.spark.api.java.JavaRDD<M> newData)` Default implementation which randomly splits new data into train/test sets.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - MODEL_FILE_NAME
```
public static final String MODEL_FILE_NAME
```
    See Also:
    
    Constant Field Values
- Constructor Detail
  - MLUpdate
```
protected MLUpdate(com.typesafe.config.Config config)
```
- Method Detail
  - getTestFraction
```
protected final double getTestFraction()
```
  - getHyperParameterValues
```
public List<HyperParamValues<?>> getHyperParameterValues()
```
    Returns:
    
    a list of hyperparameter value ranges to try, one HyperParamValues per hyperparameter. Different combinations of the values derived from the list will be passed back into buildModel(JavaSparkContext,JavaRDD,List,Path)
  - buildModel
```
public abstract org.dmg.pmml.PMML buildModel(org.apache.spark.api.java.JavaSparkContext sparkContext,
                                             org.apache.spark.api.java.JavaRDD<M> trainData,
                                             List<?> hyperParameters,
                                             org.apache.hadoop.fs.Path candidatePath)
```
    Parameters:
    
    sparkContext - active Spark Context
    
    trainData - training data on which to build a model
    
    hyperParameters - ordered list of hyper parameter values to use in building model
    
    candidatePath - directory where additional model files can be written
    
    Returns:
    
    a PMML representation of a model trained on the given data
  - canPublishAdditionalModelData
```
public boolean canPublishAdditionalModelData()
```
    Returns:
    
    true iff additional updates must be published along with the model; if publishAdditionalModelData(JavaSparkContext, PMML, JavaRDD, JavaRDD, Path, TopicProducer) must be called. This is only applicable for special model types.
  - publishAdditionalModelData
```
public void publishAdditionalModelData(org.apache.spark.api.java.JavaSparkContext sparkContext,
                                       org.dmg.pmml.PMML pmml,
                                       org.apache.spark.api.java.JavaRDD<M> newData,
                                       org.apache.spark.api.java.JavaRDD<M> pastData,
                                       org.apache.hadoop.fs.Path modelParentPath,
                                       TopicProducer<String,String> modelUpdateTopic)
```
    Optionally, publish additional model-related information to the update topic, after the model has been written. This is needed only in specific cases, like the ALS algorithm, where the model serialization in PMML can't contain all of the info.
    
    Parameters:
    
    sparkContext - active Spark Context
    
    pmml - model for which extra data should be written
    
    newData - data that has arrived in current interval
    
    pastData - all previously-known data (may be null)
    
    modelParentPath - directory containing model files, if applicable
    
    modelUpdateTopic - message topic to write to
  - evaluate
```
public abstract double evaluate(org.apache.spark.api.java.JavaSparkContext sparkContext,
                                org.dmg.pmml.PMML model,
                                org.apache.hadoop.fs.Path modelParentPath,
                                org.apache.spark.api.java.JavaRDD<M> testData,
                                org.apache.spark.api.java.JavaRDD<M> trainData)
```
    Parameters:
    
    sparkContext - active Spark Context
    
    model - model to evaluate
    
    modelParentPath - directory containing model files, if applicable
    
    testData - data on which to test the model performance
    
    trainData - data on which model was trained, which can also be useful in evaluating unsupervised learning problems
    
    Returns:
    
    an evaluation of the model on the test data. Higher should mean "better"
  - runUpdate
```
public void runUpdate(org.apache.spark.api.java.JavaSparkContext sparkContext,
                      long timestamp,
                      org.apache.spark.api.java.JavaPairRDD<Object,M> newKeyMessageData,
                      org.apache.spark.api.java.JavaPairRDD<Object,M> pastKeyMessageData,
                      String modelDirString,
                      TopicProducer<String,String> modelUpdateTopic)
               throws IOException,
                      InterruptedException
```
    Specified by:
    
    runUpdate in interface BatchLayerUpdate<Object,M,String>
    
    Parameters:
    
    sparkContext - Spark context
    
    timestamp - timestamp of current interval
    
    newKeyMessageData - data that has arrived in current interval
    
    pastKeyMessageData - all previously-known data (may be null)
    
    modelDirString - String representation of path where models should be output, if desired
    
    modelUpdateTopic - topic to push models onto, if desired. Note that this may be null if the application is configured to not produce updates to a topic
    
    Throws:
    
    IOException - if an error occurs during execution of the update function
    
    InterruptedException - if the caller is interrupted waiting for parallel tasks to complete
  - splitNewDataToTrainTest
```
protected Pair<org.apache.spark.api.java.JavaRDD<M>,org.apache.spark.api.java.JavaRDD<M>> splitNewDataToTrainTest(org.apache.spark.api.java.JavaRDD<M> newData)
```
    Default implementation which randomly splits new data into train/test sets. This handles the case where getTestFraction() is not 0 or 1.
    
    Parameters:
    
    newData - data that has arrived in the current input batch
    
    Returns:
    
    a Pair of train, test RDDs.

Class MLUpdate<M>

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

MODEL_FILE_NAME

Constructor Detail

MLUpdate

Method Detail

getTestFraction

getHyperParameterValues

buildModel

canPublishAdditionalModelData

publishAdditionalModelData

evaluate

runUpdate

splitNewDataToTrainTest