Back to the program.
Hidden Grammars for text documents categorization
Amin Mantrach
IRIDIA
amantrac@ulb.ac.be

Abstract

Stochastics context-free grammars are well studied by the computational linguistics community. We'll describe an extension of the concept of SCFGs to include the case where the observation is a probabilistic function of the terminal, the resulting model (which we have named a hidden stochastic context-free grammar by analogy to hidden Markov model) is a doubly embedded stochastic process with an underlying stochastic process that is not observable (it is hidden), but can only be observed through another set of stochastic process that produce the sequence of observations. After introducing the concept we present the three basic problems for the model to be useful in real-world application : i) What is the probability that a given string x is generated by a grammar G? ii) What is the single most likely parse for x? iii) How should the parameters (e.g., rule probabilities and symbol probabilities attached to each terminal) in G be chosen to maximize the probability over a training set? We'll present an application of the model for the classification of documents using the content and the structure.

Keywords

hidden markov models, context-free grammar, stochastics models, stochastic grammar, text categorization