M - type of message to read from the input topicpublic abstract class MLUpdate<M> extends Object implements BatchLayerUpdate<Object,M,String>
BatchLayerUpdate for machine learning-oriented
update processes. This implementation contains the framework for test/train split
for example, parameter optimization, and so on. Subclasses instead implement
methods like buildModel(JavaSparkContext,JavaRDD,List,Path) to create a PMML model and
evaluate(JavaSparkContext,PMML,Path,JavaRDD,JavaRDD) to evaluate a model from
held-out test data.| Modifier and Type | Field and Description |
|---|---|
static String |
MODEL_FILE_NAME |
| Modifier | Constructor and Description |
|---|---|
protected |
MLUpdate(com.typesafe.config.Config config) |
| Modifier and Type | Method and Description |
|---|---|
abstract org.dmg.pmml.PMML |
buildModel(org.apache.spark.api.java.JavaSparkContext sparkContext,
org.apache.spark.api.java.JavaRDD<M> trainData,
List<?> hyperParameters,
org.apache.hadoop.fs.Path candidatePath) |
boolean |
canPublishAdditionalModelData() |
abstract double |
evaluate(org.apache.spark.api.java.JavaSparkContext sparkContext,
org.dmg.pmml.PMML model,
org.apache.hadoop.fs.Path modelParentPath,
org.apache.spark.api.java.JavaRDD<M> testData,
org.apache.spark.api.java.JavaRDD<M> trainData) |
List<HyperParamValues<?>> |
getHyperParameterValues() |
protected double |
getTestFraction() |
void |
publishAdditionalModelData(org.apache.spark.api.java.JavaSparkContext sparkContext,
org.dmg.pmml.PMML pmml,
org.apache.spark.api.java.JavaRDD<M> newData,
org.apache.spark.api.java.JavaRDD<M> pastData,
org.apache.hadoop.fs.Path modelParentPath,
TopicProducer<String,String> modelUpdateTopic)
Optionally, publish additional model-related information to the update topic,
after the model has been written.
|
void |
runUpdate(org.apache.spark.api.java.JavaSparkContext sparkContext,
long timestamp,
org.apache.spark.api.java.JavaPairRDD<Object,M> newKeyMessageData,
org.apache.spark.api.java.JavaPairRDD<Object,M> pastKeyMessageData,
String modelDirString,
TopicProducer<String,String> modelUpdateTopic) |
protected Pair<org.apache.spark.api.java.JavaRDD<M>,org.apache.spark.api.java.JavaRDD<M>> |
splitNewDataToTrainTest(org.apache.spark.api.java.JavaRDD<M> newData)
Default implementation which randomly splits new data into train/test sets.
|
public static final String MODEL_FILE_NAME
protected final double getTestFraction()
public List<HyperParamValues<?>> getHyperParameterValues()
HyperParamValues per
hyperparameter. Different combinations of the values derived from the list will be
passed back into buildModel(JavaSparkContext,JavaRDD,List,Path)public abstract org.dmg.pmml.PMML buildModel(org.apache.spark.api.java.JavaSparkContext sparkContext,
org.apache.spark.api.java.JavaRDD<M> trainData,
List<?> hyperParameters,
org.apache.hadoop.fs.Path candidatePath)
sparkContext - active Spark ContexttrainData - training data on which to build a modelhyperParameters - ordered list of hyper parameter values to use in building modelcandidatePath - directory where additional model files can be writtenPMML representation of a model trained on the given datapublic boolean canPublishAdditionalModelData()
true iff additional updates must be published along with the model; if
publishAdditionalModelData(JavaSparkContext, PMML, JavaRDD, JavaRDD, Path, TopicProducer) must
be called. This is only applicable for special model types.public void publishAdditionalModelData(org.apache.spark.api.java.JavaSparkContext sparkContext,
org.dmg.pmml.PMML pmml,
org.apache.spark.api.java.JavaRDD<M> newData,
org.apache.spark.api.java.JavaRDD<M> pastData,
org.apache.hadoop.fs.Path modelParentPath,
TopicProducer<String,String> modelUpdateTopic)
sparkContext - active Spark Contextpmml - model for which extra data should be writtennewData - data that has arrived in current intervalpastData - all previously-known data (may be null)modelParentPath - directory containing model files, if applicablemodelUpdateTopic - message topic to write topublic abstract double evaluate(org.apache.spark.api.java.JavaSparkContext sparkContext,
org.dmg.pmml.PMML model,
org.apache.hadoop.fs.Path modelParentPath,
org.apache.spark.api.java.JavaRDD<M> testData,
org.apache.spark.api.java.JavaRDD<M> trainData)
sparkContext - active Spark Contextmodel - model to evaluatemodelParentPath - directory containing model files, if applicabletestData - data on which to test the model performancetrainData - data on which model was trained, which can also be useful in evaluating
unsupervised learning problemspublic void runUpdate(org.apache.spark.api.java.JavaSparkContext sparkContext,
long timestamp,
org.apache.spark.api.java.JavaPairRDD<Object,M> newKeyMessageData,
org.apache.spark.api.java.JavaPairRDD<Object,M> pastKeyMessageData,
String modelDirString,
TopicProducer<String,String> modelUpdateTopic)
throws IOException,
InterruptedException
runUpdate in interface BatchLayerUpdate<Object,M,String>sparkContext - Spark contexttimestamp - timestamp of current intervalnewKeyMessageData - data that has arrived in current intervalpastKeyMessageData - all previously-known data (may be null)modelDirString - String representation of path where models should be output, if desiredmodelUpdateTopic - topic to push models onto, if desired. Note that this may be null
if the application is configured to not produce updates to a topicIOException - if an error occurs during execution of the update functionInterruptedException - if the caller is interrupted waiting for parallel tasks
to completeprotected Pair<org.apache.spark.api.java.JavaRDD<M>,org.apache.spark.api.java.JavaRDD<M>> splitNewDataToTrainTest(org.apache.spark.api.java.JavaRDD<M> newData)
getTestFraction() is not 0 or 1.newData - data that has arrived in the current input batchPair of train, test RDDs.Copyright © 2014–2018. All rights reserved.