Adaptive Information Disclosure (AID)

Participating in the VL-e project

Adaptive Information Disclosure (AID) header image 1

Learning

AIDA includes several components which enable information extraction from the text data. These components are referred to as learning tools. The large community working on the information extraction task has already produced numerous data sets and tools to work with them. To be able to use existing solutions, we incorporated some of the models trained on the large corpora into the named entity recognition web service NERecognizerService. These models are provided by LingPipe \cite{lingp} and range from the very general named entity recognition (detecting locations, person and organization names) to the specific models in the biomedical field created to recognize protein names and other bio-entities. We specified several options for input/output, which gives us an opportunity to work with either text data or the output of the search engine Lucene. The latter scenario is beneficial for a user who intends first to retrieve documents of his interest and then to zoom into pieces of text which are more specific. Output can be presented as a list of named entities or as the annotated sentences.

However, such solutions may not comply with the users’ needs to detect named entities in domains other than the biomedical domain. To address this problem, we offer LearnModel web service whose aim is to produce a model given the annotated text data. A model is based on the contextual information and use learning methods provided by Weka \cite{witten} libraries. Once such a model is created, it can be used by the TestModel web service to annotate texts in the same domain. Splitting the entire process in two parts is useful from several perspectives. First of all, to annotate texts (i.e. to use TestModel), it is not necessary for a user to apply his own model. Given a large collection of already created models, he can compare them based on the 10-fold cross-validation performance. Another attractive option for creating models is to use sequential models, such as conditional random fields (CRFs), which have gained increasing popularity in the past few years. Although hidden Markov models (HMM) have often been used for labeling sequences, CRFs have an advantage over them because of their ability to relax the independence assumption by defining a conditional probability distribution over label sequences given an observation sequence. We used CRFs to detect named entities in several domains like acids of various lengths in the food informatics field or protein names in the biomedical field \cite{katr2}.

Named entity recognition constitutes only one subtask in information extraction. Relation extraction can be naturally viewed as a next step after the named entity recognition is carried out \cite{katr1}. This task can be decomposed in detection of named entities and verification of a given relation among them. For example, given extracted protein names, it should possible to infer whether there is any interaction between two proteins. This task is accomplished by the RelationLearner web service. It uses the annotated corpus of relations to induce a model, which consequently can be applied to the test data with already detected named entities. The RelationLearner focuses on extraction of binary relations given the sentential context. Its output is a list of the named entities pairs,where the given relation holds.

The other relevant area for information extraction is detection of the collocations (or n-grams in the broader sense). This functionality is provided by the CollocationService which, given the folder with the text documents, outputs the n-grams of the desired frequency and length.