pyspark.ml.clustering.
LDA
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Terminology:
“term” = “word”: an element of the vocabulary “token”: instance of a term appearing in a document “topic”: multinomial distribution over terms representing some concept “document”: one piece of text, corresponding to one row in the input data
“term” = “word”: an element of the vocabulary
“token”: instance of a term appearing in a document
“topic”: multinomial distribution over terms representing some concept
“document”: one piece of text, corresponding to one row in the input data
Blei, Ng, and Jordan. “Latent Dirichlet Allocation.” JMLR, 2003.
Input data (featuresCol): LDA is given a collection of documents as input data, via the featuresCol parameter. Each document is specified as a Vector of length vocabSize, where each entry is the count for the corresponding term (word) in the document. Feature transformers such as pyspark.ml.feature.Tokenizer and pyspark.ml.feature.CountVectorizer can be useful for converting text to word count vectors.
Vector
pyspark.ml.feature.Tokenizer
pyspark.ml.feature.CountVectorizer
New in version 2.0.0.
Examples
>>> from pyspark.ml.linalg import Vectors, SparseVector >>> from pyspark.ml.clustering import LDA >>> df = spark.createDataFrame([[1, Vectors.dense([0.0, 1.0])], ... [2, SparseVector(2, {0: 1.0})],], ["id", "features"]) >>> lda = LDA(k=2, seed=1, optimizer="em") >>> lda.setMaxIter(10) LDA... >>> lda.getMaxIter() 10 >>> lda.clear(lda.maxIter) >>> model = lda.fit(df) >>> model.setSeed(1) DistributedLDAModel... >>> model.getTopicDistributionCol() 'topicDistribution' >>> model.isDistributed() True >>> localModel = model.toLocal() >>> localModel.isDistributed() False >>> model.vocabSize() 2 >>> model.describeTopics().show() +-----+-----------+--------------------+ |topic|termIndices| termWeights| +-----+-----------+--------------------+ | 0| [1, 0]|[0.50401530077160...| | 1| [0, 1]|[0.50401530077160...| +-----+-----------+--------------------+ ... >>> model.topicsMatrix() DenseMatrix(2, 2, [0.496, 0.504, 0.504, 0.496], 0) >>> lda_path = temp_path + "/lda" >>> lda.save(lda_path) >>> sameLDA = LDA.load(lda_path) >>> distributed_model_path = temp_path + "/lda_distributed_model" >>> model.save(distributed_model_path) >>> sameModel = DistributedLDAModel.load(distributed_model_path) >>> local_model_path = temp_path + "/lda_local_model" >>> localModel.save(local_model_path) >>> sameLocalModel = LocalLDAModel.load(local_model_path) >>> model.transform(df).take(1) == sameLocalModel.transform(df).take(1) True
Methods
clear(param)
clear
Clears a param from the param map if it has been explicitly set.
copy([extra])
copy
Creates a copy of this instance with the same uid and some extra params.
explainParam(param)
explainParam
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
explainParams()
explainParams
Returns the documentation of all params with their optionally default values and user-supplied values.
extractParamMap([extra])
extractParamMap
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
fit(dataset[, params])
fit
Fits a model to the input dataset with optional parameters.
fitMultiple(dataset, paramMaps)
fitMultiple
Fits a model to the input dataset for each param map in paramMaps.
getCheckpointInterval()
getCheckpointInterval
Gets the value of checkpointInterval or its default value.
getDocConcentration()
getDocConcentration
Gets the value of docConcentration or its default value.
docConcentration
getFeaturesCol()
getFeaturesCol
Gets the value of featuresCol or its default value.
getK()
getK
Gets the value of k or its default value.
k
getKeepLastCheckpoint()
getKeepLastCheckpoint
Gets the value of keepLastCheckpoint or its default value.
keepLastCheckpoint
getLearningDecay()
getLearningDecay
Gets the value of learningDecay or its default value.
learningDecay
getLearningOffset()
getLearningOffset
Gets the value of learningOffset or its default value.
learningOffset
getMaxIter()
getMaxIter
Gets the value of maxIter or its default value.
getOptimizeDocConcentration()
getOptimizeDocConcentration
Gets the value of optimizeDocConcentration or its default value.
optimizeDocConcentration
getOptimizer()
getOptimizer
Gets the value of optimizer or its default value.
optimizer
getOrDefault(param)
getOrDefault
Gets the value of a param in the user-supplied param map or its default value.
getParam(paramName)
getParam
Gets a param by its name.
getSeed()
getSeed
Gets the value of seed or its default value.
getSubsamplingRate()
getSubsamplingRate
Gets the value of subsamplingRate or its default value.
subsamplingRate
getTopicConcentration()
getTopicConcentration
Gets the value of topicConcentration or its default value.
topicConcentration
getTopicDistributionCol()
getTopicDistributionCol
Gets the value of topicDistributionCol or its default value.
topicDistributionCol
hasDefault(param)
hasDefault
Checks whether a param has a default value.
hasParam(paramName)
hasParam
Tests whether this instance contains a param with a given (string) name.
isDefined(param)
isDefined
Checks whether a param is explicitly set by user or has a default value.
isSet(param)
isSet
Checks whether a param is explicitly set by user.
load(path)
load
Reads an ML instance from the input path, a shortcut of read().load(path).
read()
read
Returns an MLReader instance for this class.
save(path)
save
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
set(param, value)
set
Sets a parameter in the embedded param map.
setCheckpointInterval(value)
setCheckpointInterval
Sets the value of checkpointInterval.
checkpointInterval
setDocConcentration(value)
setDocConcentration
Sets the value of docConcentration.
setFeaturesCol(value)
setFeaturesCol
Sets the value of featuresCol.
featuresCol
setK(value)
setK
Sets the value of k.
setKeepLastCheckpoint(value)
setKeepLastCheckpoint
Sets the value of keepLastCheckpoint.
setLearningDecay(value)
setLearningDecay
Sets the value of learningDecay.
setLearningOffset(value)
setLearningOffset
Sets the value of learningOffset.
setMaxIter(value)
setMaxIter
Sets the value of maxIter.
maxIter
setOptimizeDocConcentration(value)
setOptimizeDocConcentration
Sets the value of optimizeDocConcentration.
setOptimizer(value)
setOptimizer
Sets the value of optimizer.
setParams(self, \*[, featuresCol, maxIter, …])
setParams
Sets params for LDA.
setSeed(value)
setSeed
Sets the value of seed.
seed
setSubsamplingRate(value)
setSubsamplingRate
Sets the value of subsamplingRate.
setTopicConcentration(value)
setTopicConcentration
Sets the value of topicConcentration.
setTopicDistributionCol(value)
setTopicDistributionCol
Sets the value of topicDistributionCol.
write()
write
Returns an MLWriter instance for this ML instance.
Attributes
params
Returns all params ordered by name.
Methods Documentation
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Extra parameters to copy to the new instance
JavaParams
Copy of this instance
extra param values
merged param map
New in version 1.3.0.
pyspark.sql.DataFrame
input dataset.
an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
Transformer
fitted model(s)
New in version 2.3.0.
collections.abc.Sequence
A Sequence of param maps.
_FitMultipleIterator
A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
>>> algo = LDA().setDocConcentration([0.1, 0.2]) >>> algo.getDocConcentration() [0.1..., 0.2...]
>>> algo = LDA().setK(10) >>> algo.getK() 10
>>> algo = LDA().setKeepLastCheckpoint(False) >>> algo.getKeepLastCheckpoint() False
>>> algo = LDA().setLearningDecay(0.1) >>> algo.getLearningDecay() 0.1...
>>> algo = LDA().setLearningOffset(100) >>> algo.getLearningOffset() 100.0
>>> algo = LDA().setOptimizeDocConcentration(True) >>> algo.getOptimizeDocConcentration() True
Sets the value of optimizer. Currently only support ‘em’ and ‘online’.
>>> algo = LDA().setOptimizer("em") >>> algo.getOptimizer() 'em'
>>> algo = LDA().setSubsamplingRate(0.1) >>> algo.getSubsamplingRate() 0.1...
>>> algo = LDA().setTopicConcentration(0.5) >>> algo.getTopicConcentration() 0.5...
>>> algo = LDA().setTopicDistributionCol("topicDistributionCol") >>> algo.getTopicDistributionCol() 'topicDistributionCol'
Attributes Documentation
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
dir()
Param