M
- type of message to read from the input topicpublic abstract class MLUpdate<M> extends Object implements BatchLayerUpdate<Object,M,String>
BatchLayerUpdate
for machine learning-oriented
update processes. This implementation contains the framework for test/train split
for example, parameter optimization, and so on. Subclasses instead implement
methods like buildModel(JavaSparkContext,JavaRDD,List,Path)
to create a PMML model and
evaluate(JavaSparkContext,PMML,Path,JavaRDD,JavaRDD)
to evaluate a model from
held-out test data.Modifier and Type | Field and Description |
---|---|
static String |
MODEL_FILE_NAME |
Modifier | Constructor and Description |
---|---|
protected |
MLUpdate(com.typesafe.config.Config config) |
Modifier and Type | Method and Description |
---|---|
abstract org.dmg.pmml.PMML |
buildModel(org.apache.spark.api.java.JavaSparkContext sparkContext,
org.apache.spark.api.java.JavaRDD<M> trainData,
List<?> hyperParameters,
org.apache.hadoop.fs.Path candidatePath) |
boolean |
canPublishAdditionalModelData() |
abstract double |
evaluate(org.apache.spark.api.java.JavaSparkContext sparkContext,
org.dmg.pmml.PMML model,
org.apache.hadoop.fs.Path modelParentPath,
org.apache.spark.api.java.JavaRDD<M> testData,
org.apache.spark.api.java.JavaRDD<M> trainData) |
List<HyperParamValues<?>> |
getHyperParameterValues() |
protected double |
getTestFraction() |
void |
publishAdditionalModelData(org.apache.spark.api.java.JavaSparkContext sparkContext,
org.dmg.pmml.PMML pmml,
org.apache.spark.api.java.JavaRDD<M> newData,
org.apache.spark.api.java.JavaRDD<M> pastData,
org.apache.hadoop.fs.Path modelParentPath,
TopicProducer<String,String> modelUpdateTopic)
Optionally, publish additional model-related information to the update topic,
after the model has been written.
|
void |
runUpdate(org.apache.spark.api.java.JavaSparkContext sparkContext,
long timestamp,
org.apache.spark.api.java.JavaPairRDD<Object,M> newKeyMessageData,
org.apache.spark.api.java.JavaPairRDD<Object,M> pastKeyMessageData,
String modelDirString,
TopicProducer<String,String> modelUpdateTopic) |
protected Pair<org.apache.spark.api.java.JavaRDD<M>,org.apache.spark.api.java.JavaRDD<M>> |
splitNewDataToTrainTest(org.apache.spark.api.java.JavaRDD<M> newData)
Default implementation which randomly splits new data into train/test sets.
|
public static final String MODEL_FILE_NAME
protected final double getTestFraction()
public List<HyperParamValues<?>> getHyperParameterValues()
HyperParamValues
per
hyperparameter. Different combinations of the values derived from the list will be
passed back into buildModel(JavaSparkContext,JavaRDD,List,Path)
public abstract org.dmg.pmml.PMML buildModel(org.apache.spark.api.java.JavaSparkContext sparkContext, org.apache.spark.api.java.JavaRDD<M> trainData, List<?> hyperParameters, org.apache.hadoop.fs.Path candidatePath)
sparkContext
- active Spark ContexttrainData
- training data on which to build a modelhyperParameters
- ordered list of hyper parameter values to use in building modelcandidatePath
- directory where additional model files can be writtenPMML
representation of a model trained on the given datapublic boolean canPublishAdditionalModelData()
true
iff additional updates must be published along with the model; if
publishAdditionalModelData(JavaSparkContext, PMML, JavaRDD, JavaRDD, Path, TopicProducer)
must
be called. This is only applicable for special model types.public void publishAdditionalModelData(org.apache.spark.api.java.JavaSparkContext sparkContext, org.dmg.pmml.PMML pmml, org.apache.spark.api.java.JavaRDD<M> newData, org.apache.spark.api.java.JavaRDD<M> pastData, org.apache.hadoop.fs.Path modelParentPath, TopicProducer<String,String> modelUpdateTopic)
sparkContext
- active Spark Contextpmml
- model for which extra data should be writtennewData
- data that has arrived in current intervalpastData
- all previously-known data (may be null
)modelParentPath
- directory containing model files, if applicablemodelUpdateTopic
- message topic to write topublic abstract double evaluate(org.apache.spark.api.java.JavaSparkContext sparkContext, org.dmg.pmml.PMML model, org.apache.hadoop.fs.Path modelParentPath, org.apache.spark.api.java.JavaRDD<M> testData, org.apache.spark.api.java.JavaRDD<M> trainData)
sparkContext
- active Spark Contextmodel
- model to evaluatemodelParentPath
- directory containing model files, if applicabletestData
- data on which to test the model performancetrainData
- data on which model was trained, which can also be useful in evaluating
unsupervised learning problemspublic void runUpdate(org.apache.spark.api.java.JavaSparkContext sparkContext, long timestamp, org.apache.spark.api.java.JavaPairRDD<Object,M> newKeyMessageData, org.apache.spark.api.java.JavaPairRDD<Object,M> pastKeyMessageData, String modelDirString, TopicProducer<String,String> modelUpdateTopic) throws IOException, InterruptedException
runUpdate
in interface BatchLayerUpdate<Object,M,String>
sparkContext
- Spark contexttimestamp
- timestamp of current intervalnewKeyMessageData
- data that has arrived in current intervalpastKeyMessageData
- all previously-known data (may be null
)modelDirString
- String representation of path where models should be output, if desiredmodelUpdateTopic
- topic to push models onto, if desired. Note that this may be null
if the application is configured to not produce updates to a topicIOException
- if an error occurs during execution of the update functionInterruptedException
- if the caller is interrupted waiting for parallel tasks
to completeprotected Pair<org.apache.spark.api.java.JavaRDD<M>,org.apache.spark.api.java.JavaRDD<M>> splitNewDataToTrainTest(org.apache.spark.api.java.JavaRDD<M> newData)
getTestFraction()
is not 0 or 1.newData
- data that has arrived in the current input batchPair
of train, test RDD
s.Copyright © 2014–2018. All rights reserved.