trlda.models.LDA

lower_bound(docs, num_documents=-1, inference_method='VI', max_iter=100, num_samples=1, burn_in=2)

source code

Estimate lower bound, $\mathcal{L}(\boldsymbol{\lambda})$, for the given set of documents.

Parameters:

docs (list) - a set of documents for which to perform inference
num_documents (int) - can be used to target a lower bound with a different number of documents
inference_method (str) - either 'VI' or 'GIBBS'
max_iter (int) - maximum number of belief updates in variational inference
num_samples (int) - number of samples used to estimate expected word/topic occurences
burn_in (int) - number of sampling steps performed before starting to collect samples

Returns: float

estimate of the lower bound

sample(self, num_documents, length)

source code

Samples a specified number of documents from the model.

Topics ($\boldsymbol{\beta}$) are first sampled from the current Dirichlet beliefs over topics. This is done only once per call to sample and all documents are sampled conditioned on these topics. The length of the documents is sampled from a Poisson distribution where the rate (average length) is given by length. Documents of length zero are possible.

Words are represented as tuples of a word ID and a word count. All generated word counts will be 1, but words can occur multiple times in a document, e.g., [(12, 1), (4, 1), (12, 1)].

Parameters:

num_documents (int) - number of documents to sample
length (int) - average length of the sampled documents

Returns: list

a list of documents, where each document is a list of tuples

update_variables(docs, latents=None, inference_method='VI', max_iter=100, threshold=0.001, num_samples=1, burn_in=2)

source code

Computes beliefs over topic assignments ($z_{di}$) for the given documents.

The beliefs may be estimated via mean-field variational inference ('VI') or collapsed Gibbs sampling ('GIBBS'). For $N$ documents, the method returns a tuple of a $K \times N$-dimensional matrix and a $W \times K$-dimensional matrix of sufficient statistics. In case of variational inference, each column vector of the $K \times N$ matrix represents Dirichlet beliefs over the distribution of topics ($\boldsymbol{\theta}$) while for Gibbs sampling it represents a sample of $\boldsymbol{\theta}$ conditioned on the sampled topic assignments $\mathbf{z}$. This can be used to initialize the algorithm in a later call to update_variables via latents. The matrix of sufficient statistics indicates the expected number of occurrences of words with topics in the given set of documents.

Each document should be represented as a list of words, where each word is a tuple of a word ID and a word count.

Parameters:

docs (list) - a set of documents for which to perform inference
latents (ndarray) - can be used to initialize beliefs over $\boldsymbol{\theta}$
inference_method (str) - either 'VI' or 'GIBBS'
max_iter (int) - maximum number of belief updates in variational inference
threshold (float) - if the average change in beliefs over $\boldsymbol{\theta}$ is smaller than this, stop iterations
num_samples (int) - number of samples used to estimate expected word/topic occurences
burn_in (int) - number of MCMC updates performed before starting to collect samples

Returns: tuple

a tuple of beliefs over $\boldsymbol{\theta}$ and sufficient statistics

Properties
	alpha Controls Dirichlet prior over topic weights, $\theta_k$.
	eta Controls Dirichlet prior over topics, $\beta_{ki}$.
	lambdas Parameters governing beliefs over topics, $\beta_{ki}$.
	num_topics Number of topics.
	num_words Number of words.
Inherited from `object`: `__class__`

Class LDA

lower_bound(docs, num_documents=-1, inference_method='VI', max_iter=100, num_samples=1, burn_in=2)

sample(self, num_documents, length)

update_variables(docs, latents=None, inference_method='VI', max_iter=100, threshold=0.001, num_samples=1, burn_in=2)