Package trlda :: Package models :: Class LDA
[frames] | no frames]

Class LDA

source code

  object --+    
           |    
Distribution --+
               |
              LDA
Known Subclasses:

Abstract base class.

Instance Methods
 
do_e_step(...)
Alias for update_variables.
source code
float
lower_bound(docs, num_documents=-1, inference_method='VI', max_iter=100, num_samples=1, burn_in=2)
Estimate lower bound, $\mathcal{L}(\boldsymbol{\lambda})$, for the given set of documents.
source code
list
sample(self, num_documents, length)
Samples a specified number of documents from the model.
source code
tuple
update_variables(docs, latents=None, inference_method='VI', max_iter=100, threshold=0.001, num_samples=1, burn_in=2)
Computes beliefs over topic assignments ($z_{di}$) for the given documents.
source code

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __init__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

Properties
  alpha
Controls Dirichlet prior over topic weights, $\theta_k$.
  eta
Controls Dirichlet prior over topics, $\beta_{ki}$.
  lambdas
Parameters governing beliefs over topics, $\beta_{ki}$.
  num_topics
Number of topics.
  num_words
Number of words.

Inherited from object: __class__

Method Details

lower_bound(docs, num_documents=-1, inference_method='VI', max_iter=100, num_samples=1, burn_in=2)

source code 

Estimate lower bound, $\mathcal{L}(\boldsymbol{\lambda})$, for the given set of documents.

Parameters:
  • docs (list) - a set of documents for which to perform inference
  • num_documents (int) - can be used to target a lower bound with a different number of documents
  • inference_method (str) - either 'VI' or 'GIBBS'
  • max_iter (int) - maximum number of belief updates in variational inference
  • num_samples (int) - number of samples used to estimate expected word/topic occurences
  • burn_in (int) - number of sampling steps performed before starting to collect samples
Returns: float
estimate of the lower bound

sample(self, num_documents, length)

source code 

Samples a specified number of documents from the model.

Topics ($\boldsymbol{\beta}$) are first sampled from the current Dirichlet beliefs over topics. This is done only once per call to sample and all documents are sampled conditioned on these topics. The length of the documents is sampled from a Poisson distribution where the rate (average length) is given by length. Documents of length zero are possible.

Words are represented as tuples of a word ID and a word count. All generated word counts will be 1, but words can occur multiple times in a document, e.g., [(12, 1), (4, 1), (12, 1)].

Parameters:
  • num_documents (int) - number of documents to sample
  • length (int) - average length of the sampled documents
Returns: list
a list of documents, where each document is a list of tuples

update_variables(docs, latents=None, inference_method='VI', max_iter=100, threshold=0.001, num_samples=1, burn_in=2)

source code 

Computes beliefs over topic assignments ($z_{di}$) for the given documents.

The beliefs may be estimated via mean-field variational inference ('VI') or collapsed Gibbs sampling ('GIBBS'). For $N$ documents, the method returns a tuple of a $K \times N$-dimensional matrix and a $W \times K$-dimensional matrix of sufficient statistics. In case of variational inference, each column vector of the $K \times N$ matrix represents Dirichlet beliefs over the distribution of topics ($\boldsymbol{\theta}$) while for Gibbs sampling it represents a sample of $\boldsymbol{\theta}$ conditioned on the sampled topic assignments $\mathbf{z}$. This can be used to initialize the algorithm in a later call to update_variables via latents. The matrix of sufficient statistics indicates the expected number of occurrences of words with topics in the given set of documents.

Each document should be represented as a list of words, where each word is a tuple of a word ID and a word count.

Parameters:
  • docs (list) - a set of documents for which to perform inference
  • latents (ndarray) - can be used to initialize beliefs over $\boldsymbol{\theta}$
  • inference_method (str) - either 'VI' or 'GIBBS'
  • max_iter (int) - maximum number of belief updates in variational inference
  • threshold (float) - if the average change in beliefs over $\boldsymbol{\theta}$ is smaller than this, stop iterations
  • num_samples (int) - number of samples used to estimate expected word/topic occurences
  • burn_in (int) - number of MCMC updates performed before starting to collect samples
Returns: tuple
a tuple of beliefs over $\boldsymbol{\theta}$ and sufficient statistics