Summary:
There are generally two ways to recognize entities from text articles when using Ontotext Kim. Example entities are: people, organizations, locations.
Both methods have their strengths and weaknesses. Things that can not detected by humans can not also be detected by computers.
Using strict rules(better known as Knowledge Engineering)
These rules are implemented by some regular expression language. In this case it is Jape.
The more you customize the rules to detect what you need, the better results you get.
Advantages:
- it you have some rules already available (e.x. for date, money ...) then it might be faster to create the new rules you need and get the job done fast
- a smaller sample corpus might be required in some cases than when using machine learning
- in general effectiveness is bound to the amount of efforts that are needed to produce better rules
Weaknesses:
- in practice the rules might become quite complicated and hard to support. Imagine a 20KB file that describes only one entity. You end up not reading the previous rules and modifying one of them, but rather adding the specific case that was missing in the end of the file and thus increasing the total length of the file and the total complexity of the rules. This is especially true when different people are modifying these rules.
Machine learning
In order to use machine learning you need a framework that implement several machine learning algorithms. You as an expert can define features which will be taken in consideration when the framework is processing the example data:
- consider the length of the word
- consider the case-sensitivity
- consider the case-sensitivity of the previous word
- consider prefixes and suffixes
The idea is not to set the exact rules, but rather make the framework build them itself from specific parts of the text you told the framework to pay attention to. Then you need to supply the machine-learning framework with enough test articles.
Advantages:
- it can give better results than strict rules
Often efforts needed to achieve 80% effectiveness are as much as from 80% to 85%.
Weaknesses:
- needs parameter and algorithm testing (that's actually not such a problem, it just needs some work hours)
- needs more example articles by a factor of 10(assumption) than using strict rules
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. Features in the case of MALLET are either true or false. MALLET provides several algorithms and each of them has its own configuration parameters. MALLET is implemented as plug-in for KIM/GATE.
There is also a second machine learning framework called openNLP that is also implemented as GATE plug-in in Ontotext KIM, and might soon be released as part of the standard KIM/GATE release.
Conclusions/final thoughts:
Both rule-based and machine learning are supported by the custom GATE pipeline for semantic annotation developed by Ontotext for the KIM platform.
In short term it is better to use strict regular expression rules(like Jape) as it gives you results almost momentarily, but in a long term (from both complexity and effectiveness point of view) is definitely better to use a machine learning system like MALLET. Of course a combination of the two should work best, where rules are used first(some of them negative) and then machine learning is applied.
Another machine learning project is: edlin.sourceforge.net
This a short introduction, contact Ontotext for more detailed information.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Sunday, May 31. 2009
Strict Rules vs Machine Learning - KIM part 6
Trackbacks
Trackback specific URI for this entry
No Trackbacks






