Click to enlargeSummary
This short article is more about of process of semantic annotation and not what exactly it is. If you want to learn more about semantic annotation check here. The example here is just one way of doing things and it is used in the Ontotext KIM platform. The article started as of to show the importance of the "Instance URI".
Let's define entity as something that has some value which is worth identifying it. In the case of "named entities" an entity is for example a person or a location presented in a text in the form of: "John Brown" or "Europe".
When we have an entity in the text, that we want to use, we usually get a URI to it that uniquely identifies it. We are going to call this URI the "Instance URI". The "Instance URI" helps us identify when we are talking about the same thing while processing thousands of documents.
Another useful clarification is that "Semantic Database" is a software for storing RDF, it may have or may not have a reasoner enabled and provides an API for SeRQL or/and SPARQL queries.
1. First a Gazetteer is used to locate previously known named entities. A Gazetteer is a GATE processing resource which locates named entities using some database. Usually the database is represented by well organized text files or RDF. If our entity is found in the Gazetteer then we are done, because the Gazetteer has enough information for the entity including its "Instance URI".
2. It's obvious that we can not match everything from some kind of a database. So next we explore the idea of using some rules (in this case GATE JAPE rules) to identify named entities. When tuned, rules work fine in many cases. Using machine learning is better, but it also has shortcomings.
3. Next is the Orthomatcher which is also a processing resource in GATE. The job of the Orthomatcher is complex, it tries to resolve when "Mr. Brown" and "Brown" are the same thing, actually the same person. The Orthomatcher finds the different representations of the same entity and puts the most informative as main. If we have "Sun", "Sun Microsystems", "Sun Microsystems Inc." then we can say that the last one is the most informative. Often the most informative is the longest one as it is suppose to carry more information. So we now have the "most informative representation" of the entity and we know where in the text the author meant the same thing (same entity). We could name it "MIR"(Most Representative Entity).
4. The Instance Generator is the GATE processing resources that generates the URI we are looking for.
- There are default algorithms that uses MIR and generate a unique URI which is the one we call the "Instance URI".
- Then RDF is generated which describes the new entity using the "Instance URI".
- Then the RDF is being send to a semantic database. The semantic database does the job of adding or updating the RDF for this entity.
- In the end the Instance Generator returns the "Instance URI" that was generated based on MIR to the user(you).
Conclusion
It is important to note that every time we have the same MIR then the same URI is generated. This is the default and logical behavior for named entities.
Actually generating URIs is an interesting topic. For example the URI may include the current time-stamp in it. If the time-stamp is precise to the second and because of the fact that we are always going towards the future, then every time we encounter this entity we are going to have a different time-stamp and so a different URI. So we won't be able to have entity co-occurrence because different URIs mean different entities in RDF. In some cases having different instances for the same thing is really what we need.
Remarks:
The "OWLIM" that is classified as a semantic database is a product developed by Ontotext.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0





