I am investigating whether I can run Linux and Ontotext KIM on the Play Station 3. As far as I see I need the IBM Java VM.
Looks like Fedora 11 install under PS3 just fine. More here
Officially KIM supports only the Sun JVM. I have tried the KIM platform under the IBM Java VM on a 32-bit Fedora 11 on Intel machine. I have downloaded it from IBM web-site. The archive name is: "ibm-java-sdk-6.0-5.0-linux-i386.tgz". I also did fix a potential problem in my Relations framework in the development version of KIM 3.0 thanks to using the IBM JVM.
1. There seems to be a bug related to incompatibility with older com.thoughtworks.xstream.XStream and IBM Java VM: http://jira.codehaus.org/browse/XSTR-379
The resolution is to replace the current XStream 1.2 with the current XStream 1.3.1
We did that today in the main-stream KIM 3.0 dev version.
2. The line:
KIM_OPTS="$KIM_OPTS -Xshare:off" in KIM_control.sh
prevents KIM from starting because "-Xshare:off" is not supported parameter from the IBM Java VM.
The resolution that allowed KIM to start is: KIM_OPTS="$KIM_OPTS"
I have modified the KIM bash scripts to resolve this problem.
Fixing these two problems now allows KIM to start normally with IBM JVM and it should also work with PowerPC version of the IBM JVM(32-bit iSeries/pSeries). The archive for the Play Station 3 should be "ibm-java-sdk-6.0-5.0-linux-ppc.tgz".
OpenJDK also starts KIM just fine and seems to work for now.
Also keep in mind that officially Ontotext KIM does not support non-Sun VMs.
Now the only missing piece is the Play Station 3 itself.
Monday, September 7. 2009
Semantic annotation now looks even sexier
Posted by Anton Andreev
in Techno-talk
at
16:49
| Comments (0)
| Trackbacks (0)
Last modified on 2009-10-17 10:23
Semantic Techology Specialist
Last week we had some presentations/lectures here in Ontotext from the University of Sheffield. We learned more about GATE and GATE Teamware. I am going to dedicate a whole post to GATE Teamware soon. It was interesting to see some of the internals of GATE and new features in versions 5.0/5.1. Lectures were presented by Kalina Bontcheva, Senior researcher in the Natural Language Processing Group, Department of Computer Science, University of Sheffield.
I also got my certificate as a semantic technology specialist from Semsphere. The course was good to my opinion.
Photos are coming.
I also got my certificate as a semantic technology specialist from Semsphere. The course was good to my opinion.
Photos are coming.
Posted by Anton Andreev
in Techno-talk
at
11:07
| Comment (1)
| Trackbacks (0)
Last modified on 2009-09-08 22:13
Thursday, September 3. 2009
KIM Multi-threaded Clustered Client Application - KIM part 9
Summary:
Today we are going to talk about performance optimizations in the next version of KIM that will released by the end of this year. Its version number is 3.0 and most likely in will appear in October, but if needed the release will be postponed.
We are going to talk about both clustering(use of more than one KIM sever instance) and multi-threading. Threads are used for executing the KIM semantic annotator in parallel which returns annotated documents.
One of the most important setting to remember is configured in \config\nerc.properties:
# Maximum number of annotation processes that can run at the same time.
# If set to more than 1, KIM will load multiple copies of the pipelines listed in the IE_APP parameter above
# during initialization. Multiple copies of the pipeline allow for parallel annotation of up to that number of documents
# Default: 1 (parallel annotation disabled)
As you see by using this new parameter in KIM 3.0 you will get 6 instances of the pipeline, so that 6 documents could be processed(annotated) in the same time.
Now you need to take a look at the KIMProcessor I have written. The code is here. Keep in mind that this code has been created with a development build of the KIM 3.0.
Threads have been used to speed-up the supply of the documents to the KIM server. The problem is that this won't speed up you work much. If you supply KIM with too much documents and there is no free pipeline than probably your documents will be queued and you will only take memory.
You may set:
The threading functionality needs to be extended and would be useful in two cases:
1. When using the KIMProcessor with multiple KIM servers. You could set for example 5 physical machines with 1 KIM server each. The machine that is running the KIMProcessor is the one who reads the documents(Postgresql 8.4 in this example). So if you are reading the documents from a single standard hard-drive it is possible that you need to supply the articles faster now as you now have 5 servers with let's say 6 pipelines each which results in 30 pipelines. In this case the use of threading is definitely useful. Of course the threads won't help when you reach the I/O limit of your hard-drive.
2. If you have big documents they will be read slower and in the same time they will take more time to process. Using threads to supply the documents might again be too fast as all the pipelines might be busy again. A good example when you should use threads is when you load documents from a web-service and these documents are of normal news article size(not too big).
Note that in the KIMProcessor all the articles are first loaded in memory and then they are supplied to the KIM server async. The right way to code this is to use some kind of async calls to the database and use a synchronized blocking queue, so the moment a document is read it is being send to the KIM server.
Another cool feature to add to the KIMProcessor will be fail-safe support. When one of the servers is down, the documents that were sent to it should now be pulled back to the queue, so that another node in the cluster can process them. Also a automatic check should be done once in a while so that the server can be used again when it is back on-line.
The best solution is to implement KIM with Hadoop, but that will take time.
In Ontotext we have a different tested application which is used to process more than 100 000 documents. The one provided here(the KIMProcessor) is only maintained by me fro now.
Disclaimer:
Keep in mind that this post represents only my personal view of the topic. You should try different configurations and see how it works for yourself or probably use our tested tool.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Today we are going to talk about performance optimizations in the next version of KIM that will released by the end of this year. Its version number is 3.0 and most likely in will appear in October, but if needed the release will be postponed.
We are going to talk about both clustering(use of more than one KIM sever instance) and multi-threading. Threads are used for executing the KIM semantic annotator in parallel which returns annotated documents.
One of the most important setting to remember is configured in \config\nerc.properties:
# Maximum number of annotation processes that can run at the same time.
# If set to more than 1, KIM will load multiple copies of the pipelines listed in the IE_APP parameter above
# during initialization. Multiple copies of the pipeline allow for parallel annotation of up to that number of documents
# Default: 1 (parallel annotation disabled)
com.ontotext.kim.semanticannotation.PARALLEL_NERCS=6
As you see by using this new parameter in KIM 3.0 you will get 6 instances of the pipeline, so that 6 documents could be processed(annotated) in the same time.
Now you need to take a look at the KIMProcessor I have written. The code is here. Keep in mind that this code has been created with a development build of the KIM 3.0.
Threads have been used to speed-up the supply of the documents to the KIM server. The problem is that this won't speed up you work much. If you supply KIM with too much documents and there is no free pipeline than probably your documents will be queued and you will only take memory.
You may set:
com.ontotext.kim.semanticannotation.PARALLEL_NERCS=auto
and the number of pipelines will be equal to the number of processor cores reported by the OS (on Windows cmd: echo %NUMBER_OF_PROCESSORS%)The threading functionality needs to be extended and would be useful in two cases:
1. When using the KIMProcessor with multiple KIM servers. You could set for example 5 physical machines with 1 KIM server each. The machine that is running the KIMProcessor is the one who reads the documents(Postgresql 8.4 in this example). So if you are reading the documents from a single standard hard-drive it is possible that you need to supply the articles faster now as you now have 5 servers with let's say 6 pipelines each which results in 30 pipelines. In this case the use of threading is definitely useful. Of course the threads won't help when you reach the I/O limit of your hard-drive.
2. If you have big documents they will be read slower and in the same time they will take more time to process. Using threads to supply the documents might again be too fast as all the pipelines might be busy again. A good example when you should use threads is when you load documents from a web-service and these documents are of normal news article size(not too big).
Note that in the KIMProcessor all the articles are first loaded in memory and then they are supplied to the KIM server async. The right way to code this is to use some kind of async calls to the database and use a synchronized blocking queue, so the moment a document is read it is being send to the KIM server.
Another cool feature to add to the KIMProcessor will be fail-safe support. When one of the servers is down, the documents that were sent to it should now be pulled back to the queue, so that another node in the cluster can process them. Also a automatic check should be done once in a while so that the server can be used again when it is back on-line.
The best solution is to implement KIM with Hadoop, but that will take time.
In Ontotext we have a different tested application which is used to process more than 100 000 documents. The one provided here(the KIMProcessor) is only maintained by me fro now.
Disclaimer:
Keep in mind that this post represents only my personal view of the topic. You should try different configurations and see how it works for yourself or probably use our tested tool.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Posted by Anton Andreev
in Techno-talk
at
13:54
| Comments (0)
| Trackbacks (0)
Last modified on 2009-11-04 15:34
Tuesday, August 4. 2009
.NET for the iPhone
MonoTouch is a platform for developing .NET applications for the iPhone. This time I may really consider buying an iPhone. The uncertainties around the Apple store and their software kit were preventing me from doing that. Not to mention that in my country is it at least three times more expensive that in the US. The iPhone might be cool, but Apple have some weird policies and developers are not that respected as in the Android platform.
Posted by Anton Andreev
in Techno-talk
at
19:56
| Comments (2)
| Trackbacks (0)
Last modified on 2009-08-30 00:49
Wednesday, July 22. 2009
KIM Tips and Tricks - KIM PART 5
Summary:
While working with Ontotext KIM you may encounter some difficulties or you may want to do some simple common tasks, so I have created this FAQ:
1. Problems starting KIM
KIM incorporates a semantic repository - a storage and inference layer software called OWLIM (see below). It is not a relative database, we could call it a knowledge database, but still a database. It has locks, transactions and thus when you forcibly should down the database, you will get some complaints that something went wrong.
To avoid this you should always use: startKIM.bat and stopKIM.bat to start/stop the KIM server and the KIM GATE UI (\bin\startKIMGate.bat). Problems are usually related to the lock(or stack file) in \context\default\populated folder and you need to delete this file manually. This file tells OWLIM that the previous instance of OWLIM did shutdown correctly. The populated folder is where all data and index files are stored. If you do not keep something important in the database you may delete all files in the \context\default\populated folder. This will force OWLIM to reload all nt and owl files, create a new cache and thus fix some problems. It is something like a Windows reinstall. Keep in mind that clearing your entire populated folder will make KIM load much slower in order to rebuild everything.
2. I do not see my annotations
Suppose you have a Jape file and you use it KIM GATE UI(\bin\startKIMGate.bat) and you do not see your annotation in the "Annotation Sets" in GATE. If you are sure that this annotation should be visible the problem might be in the KIM GATE pipeline. The pipeline has two resources "Instance Generator" and "Annotation Cleaner". The "Instance Generator" is very important resource in KIM and part of its jobs is to clear unrecognized and temp annotations. The "Annotation Cleaner" also does a similar job. To make the "Instance Generator" aware of your annotation you need to add it to a white list in \config\nerc.properties in com.ontotext.kim.KIMConstants.IE_ANN_TYPES. The second solution is to temporally disable these two GATE resources from the pipeline. This can be done when you click on the KIM application and click a resource in "Selected processing resources".
The "Instance Generator" is the one who makes the connection between your annotation and the entities in the semantic repository (stored in the OWLIM database).
3. What is the difference between "inst" and "class"?
Generally the "inst" feature holds the URI of an entity. If the entity has been recognized by the semantic gazetteer (KIM Gazetteer), it is an existing entity in the semantic repository with a specific URI. If, on the other side, it has been extracted with the help of some Jape rules for example, the entity does not exist so only a class feature should be provided. Later, if the Instance Generator PR is a part of the pipeline, it will create a unique instance URI for this entity and put it in the semantic repository.
4. How to add a new OWL file?
1. You can use a tool in the \bin\tools called "toolRdfImport.bat(.sh)" which requires a folder as a parameter. The tool will import all files in the specified folder.
d:\kim\bin\tools>toolrdfimport c:\test\owl
Every time you add a new property you can just use this tool over the same file. Check the documentation here.
There is also another tool called "toolRdfUpdate.bat" which can make a diff between two files and then applies the changes - both add and remove. With "toolRdfImport.bat" you can only add. Unfortunately "toolRdfUpdate.bat" is not that easy to use.
2. Usually all knowledge-base files reside in "\kim\context\default\kb" and OWL files are in the subfolder "owl". The second option is to edit the file \config\owlim.ttl. As the name tells this file configures the OWLIM database. You need to add a new line in both "imports" and "defaultNS". The first line points to your OWL file and the one in "defaultNS" must be at same row number as in "imports". Every line in "imports" has a corresponding line in "defaultNS". Both sections must have the same row number count (obviously). Then you need to delete the contents of your \context\default\populated folder and restart KIM. Keep in mind that by deleting \context\default\populated you will lose all data loaded through KIM GATE UI or the "\bin\tools\toolConsolePopulate ". You will also need to wait longer for KIM to start than usual, so that OWLIM indexes and cache are built again.
5. Troubleshooting the KIM server communication
Usually communication is done through RMI or Web Services.
The most important config properties when using RMI are:
5.1. Changing the Java Policy
The JVM that KIM Server uses must be configured so that it allows calls on the ports KIM Server exposes. This is done by editing permissions in:
(JRE) /$JRE/lib/security/java.policy
(JDK) /$JDK/jre/lib/security/java.policy
The following line in the file should be modified so it allows connecting to the ports specified from an outer machine:
// allows anyone to listen on un-privileged ports
to
// allows anyone to listen, connect or accept connections on un-privileged ports
5.2. Setting up your external IP address
You get message like these:
WARN: It appears that the remote KIM RMI server, expected at XXX.XXX.XXX.XXX reported the wrong endpoint socket: ChannelIfaceImpl_Stub[UnicastRef [liveRef: [endpoint:[127.0.0.1:40695](remote),objID:
WARN: The connection is likely to fail. If it does, alert the server administrators and attach this message.
This means that the RMI do not know on which IP to listen for your incoming connections. This usually happens on machines with more than one IP address and it depends by the RMI implementation for each JVM. You need to explicitly state the IP address on which you expect to find the KIM server in config/install.properties:
This specifies a visible/outer IP for the server. Java RMI clients will try to find the KIM Server at that IP address, so it must be the same as the address used in GetService.from(IP, port) in the client code.
Usually on a Linux systems with more than one IP address you need to set the address that you are going to use to access the KIM server in \config\install.properties.
The next version KIM 3.0 will provide services through JMS. You can find a .NET demo utilizing the Web Services here.
6. How to develop Jape rules easier
A JAPE Eclipse plug-in with code-complete and syntax high-lightening is definitely a good idea, but it is not available!
Each JAPE rule is being converted to pure JAVA code.
If you make a mistake in the Java part (there might be no JAVA code) of your JAPE rule you will get an exception and you will see the generated Java code. You could copy paste this code into Eclipse and complete it comfortably. You may need to cause an exception on purpose from the very beginning to get the that code. Then when done you can paste the needed part back to the Java part of your Jape rule. Once again - Jape rules might contain Java code thanks to the GATE API or they might not. Let's illustrate what I mean:
Phase: Name
Input: Token Lookup TempDate
Options: control = appelt
Rule: EarlyDate
// early in 2002
// in early 2002
(
({Token.string == early}|
{Token.string == late}
)
({Token.string == "in"}
)?
({TempDate}|
(
{Lookup.majorType == time_modifier}
{Lookup.majorType == date_unit}
)
)
)
:date
-->
{
//removes TempDate annotation, gets the rule feature and adds a new TempDate annotation
gate.AnnotationSet date = (gate.AnnotationSet)bindings.get("date");
gate.Annotation dateAnn = (gate.Annotation)date.iterator().next();
gate.FeatureMap features = Factory.newFeatureMap();
features.put("rule", dateAnn.getFeatures().get("rule"));
features.put("rule2", "EarlyDate");
annotations.add(date.firstNode(), date.lastNode(), "TempDate",
features);
annotations.removeAll(date);
}
And the Java code I got by causing a compiler exception by putting some random symbols(here removed):
package japeactionclasses;
import java.io.*;
import java.util.*;
import gate.*;
import gate.jape.*;
import gate.creole.ontology.Ontology;
import gate.annotation.*;
import gate.util.*;
public class NameEarlyDateActionClass314
implements java.io.Serializable, RhsAction {
public void doit(Document doc, java.util.Map bindings,
AnnotationSet annotations,
AnnotationSet inputAS, AnnotationSet outputAS,
Ontology ontology) {
{//removes TempDate annotation, gets the rule feature and adds a new TempDate annotation
gate.AnnotationSet date = (gate.AnnotationSet)bindings.get("date");
gate.Annotation dateAnn = (gate.Annotation)date.iterator().next();
gate.FeatureMap features = Factory.newFeatureMap();
features.put("rule", dateAnn.getFeatures().get("rule"));
features.put("rule2", "EarlyDate");
annotations.add(date.firstNode(), date.lastNode(), "TempDate",
features);
annotations.removeAll(date);
}}}
This Java code sample should compile in Eclipse without any problem (after you add the GATE libraries). You can download these two files from: http://code.google.com/p/kimnetdemos/source/browse/#svn/trunk/Jape-template. Also keep in mind that you do not control the import section and thus you should use non-GATE classes with their fully qualified names.
OWLIM: It not only stores knowledge, but it also expands this knowledge based on the already stored knowledge. OWLIM also stores the rules that are used to infer new knowledge from the old one. OWLIM is striving to be the fastest storage and inference layer database on the market and it would backup its claims if you download it for a test drive.
I will be adding more tips while I dig deeper into the Ontotext KIM platform
and Sheffield GATE.
This FAQ is not officially provided/approved/supported by Ontotext or the GATE team.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
While working with Ontotext KIM you may encounter some difficulties or you may want to do some simple common tasks, so I have created this FAQ:
1. Problems starting KIM
KIM incorporates a semantic repository - a storage and inference layer software called OWLIM (see below). It is not a relative database, we could call it a knowledge database, but still a database. It has locks, transactions and thus when you forcibly should down the database, you will get some complaints that something went wrong.
To avoid this you should always use: startKIM.bat and stopKIM.bat to start/stop the KIM server and the KIM GATE UI (\bin\startKIMGate.bat). Problems are usually related to the lock(or stack file) in \context\default\populated folder and you need to delete this file manually. This file tells OWLIM that the previous instance of OWLIM did shutdown correctly. The populated folder is where all data and index files are stored. If you do not keep something important in the database you may delete all files in the \context\default\populated folder. This will force OWLIM to reload all nt and owl files, create a new cache and thus fix some problems. It is something like a Windows reinstall. Keep in mind that clearing your entire populated folder will make KIM load much slower in order to rebuild everything.
2. I do not see my annotations
Suppose you have a Jape file and you use it KIM GATE UI(\bin\startKIMGate.bat) and you do not see your annotation in the "Annotation Sets" in GATE. If you are sure that this annotation should be visible the problem might be in the KIM GATE pipeline. The pipeline has two resources "Instance Generator" and "Annotation Cleaner". The "Instance Generator" is very important resource in KIM and part of its jobs is to clear unrecognized and temp annotations. The "Annotation Cleaner" also does a similar job. To make the "Instance Generator" aware of your annotation you need to add it to a white list in \config\nerc.properties in com.ontotext.kim.KIMConstants.IE_ANN_TYPES. The second solution is to temporally disable these two GATE resources from the pipeline. This can be done when you click on the KIM application and click a resource in "Selected processing resources".
The "Instance Generator" is the one who makes the connection between your annotation and the entities in the semantic repository (stored in the OWLIM database).
3. What is the difference between "inst" and "class"?
Generally the "inst" feature holds the URI of an entity. If the entity has been recognized by the semantic gazetteer (KIM Gazetteer), it is an existing entity in the semantic repository with a specific URI. If, on the other side, it has been extracted with the help of some Jape rules for example, the entity does not exist so only a class feature should be provided. Later, if the Instance Generator PR is a part of the pipeline, it will create a unique instance URI for this entity and put it in the semantic repository.
4. How to add a new OWL file?
1. You can use a tool in the \bin\tools called "toolRdfImport.bat(.sh)" which requires a folder as a parameter. The tool will import all files in the specified folder.
d:\kim\bin\tools>toolrdfimport c:\test\owl
Every time you add a new property you can just use this tool over the same file. Check the documentation here.
There is also another tool called "toolRdfUpdate.bat" which can make a diff between two files and then applies the changes - both add and remove. With "toolRdfImport.bat" you can only add. Unfortunately "toolRdfUpdate.bat" is not that easy to use.
2. Usually all knowledge-base files reside in "\kim\context\default\kb" and OWL files are in the subfolder "owl". The second option is to edit the file \config\owlim.ttl. As the name tells this file configures the OWLIM database. You need to add a new line in both "imports" and "defaultNS". The first line points to your OWL file and the one in "defaultNS" must be at same row number as in "imports". Every line in "imports" has a corresponding line in "defaultNS". Both sections must have the same row number count (obviously). Then you need to delete the contents of your \context\default\populated folder and restart KIM. Keep in mind that by deleting \context\default\populated you will lose all data loaded through KIM GATE UI or the "\bin\tools\toolConsolePopulate ". You will also need to wait longer for KIM to start than usual, so that OWLIM indexes and cache are built again.
5. Troubleshooting the KIM server communication
Usually communication is done through RMI or Web Services.
The most important config properties when using RMI are:
5.1. Changing the Java Policy
The JVM that KIM Server uses must be configured so that it allows calls on the ports KIM Server exposes. This is done by editing permissions in:
(JRE) /$JRE/lib/security/java.policy
(JDK) /$JDK/jre/lib/security/java.policy
The following line in the file should be modified so it allows connecting to the ports specified from an outer machine:
// allows anyone to listen on un-privileged ports
to
// allows anyone to listen, connect or accept connections on un-privileged ports
5.2. Setting up your external IP address
You get message like these:
WARN: It appears that the remote KIM RMI server, expected at XXX.XXX.XXX.XXX reported the wrong endpoint socket: ChannelIfaceImpl_Stub[UnicastRef [liveRef: [endpoint:[127.0.0.1:40695](remote),objID:
WARN: The connection is likely to fail. If it does, alert the server administrators and attach this message.
This means that the RMI do not know on which IP to listen for your incoming connections. This usually happens on machines with more than one IP address and it depends by the RMI implementation for each JVM. You need to explicitly state the IP address on which you expect to find the KIM server in config/install.properties:
com.ontotext.kim.KIMConstants.RMI_HOST=192.168.121.139
This specifies a visible/outer IP for the server. Java RMI clients will try to find the KIM Server at that IP address, so it must be the same as the address used in GetService.from(IP, port) in the client code.
Usually on a Linux systems with more than one IP address you need to set the address that you are going to use to access the KIM server in \config\install.properties.
The next version KIM 3.0 will provide services through JMS. You can find a .NET demo utilizing the Web Services here.
6. How to develop Jape rules easier
A JAPE Eclipse plug-in with code-complete and syntax high-lightening is definitely a good idea, but it is not available!
Each JAPE rule is being converted to pure JAVA code.
If you make a mistake in the Java part (there might be no JAVA code) of your JAPE rule you will get an exception and you will see the generated Java code. You could copy paste this code into Eclipse and complete it comfortably. You may need to cause an exception on purpose from the very beginning to get the that code. Then when done you can paste the needed part back to the Java part of your Jape rule. Once again - Jape rules might contain Java code thanks to the GATE API or they might not. Let's illustrate what I mean:
Phase: Name
Input: Token Lookup TempDate
Options: control = appelt
Rule: EarlyDate
// early in 2002
// in early 2002
(
({Token.string == early}|
{Token.string == late}
)
({Token.string == "in"}
)?
({TempDate}|
(
{Lookup.majorType == time_modifier}
{Lookup.majorType == date_unit}
)
)
)
:date
-->
{
//removes TempDate annotation, gets the rule feature and adds a new TempDate annotation
gate.AnnotationSet date = (gate.AnnotationSet)bindings.get("date");
gate.Annotation dateAnn = (gate.Annotation)date.iterator().next();
gate.FeatureMap features = Factory.newFeatureMap();
features.put("rule", dateAnn.getFeatures().get("rule"));
features.put("rule2", "EarlyDate");
annotations.add(date.firstNode(), date.lastNode(), "TempDate",
features);
annotations.removeAll(date);
}
And the Java code I got by causing a compiler exception by putting some random symbols(here removed):
package japeactionclasses;
import java.io.*;
import java.util.*;
import gate.*;
import gate.jape.*;
import gate.creole.ontology.Ontology;
import gate.annotation.*;
import gate.util.*;
public class NameEarlyDateActionClass314
implements java.io.Serializable, RhsAction {
public void doit(Document doc, java.util.Map bindings,
AnnotationSet annotations,
AnnotationSet inputAS, AnnotationSet outputAS,
Ontology ontology) {
{//removes TempDate annotation, gets the rule feature and adds a new TempDate annotation
gate.AnnotationSet date = (gate.AnnotationSet)bindings.get("date");
gate.Annotation dateAnn = (gate.Annotation)date.iterator().next();
gate.FeatureMap features = Factory.newFeatureMap();
features.put("rule", dateAnn.getFeatures().get("rule"));
features.put("rule2", "EarlyDate");
annotations.add(date.firstNode(), date.lastNode(), "TempDate",
features);
annotations.removeAll(date);
}}}
This Java code sample should compile in Eclipse without any problem (after you add the GATE libraries). You can download these two files from: http://code.google.com/p/kimnetdemos/source/browse/#svn/trunk/Jape-template. Also keep in mind that you do not control the import section and thus you should use non-GATE classes with their fully qualified names.
OWLIM: It not only stores knowledge, but it also expands this knowledge based on the already stored knowledge. OWLIM also stores the rules that are used to infer new knowledge from the old one. OWLIM is striving to be the fastest storage and inference layer database on the market and it would backup its claims if you download it for a test drive.
I will be adding more tips while I dig deeper into the Ontotext KIM platform
This FAQ is not officially provided/approved/supported by Ontotext or the GATE team.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Posted by Anton Andreev
in Techno-talk
at
16:22
| Comments (0)
| Trackbacks (0)
Last modified on 2009-11-17 11:22
Monday, June 15. 2009
Text-based Query Interfaces to the Semantic Web
A list of text-based query interfaces to the Semantic Web:
This a brief list of software projects and articles related to transforming a human typed query to languages which are processable by computers.
1. QuestIO
QuestIO white paper by Danica Damljanovic, Valentin Tablan, Kalina Bontcheva (University of Sheffield)
It contains a short description of the projects covered in this post.
A good video presentation to QuestIO by Valentin Tablan
2. Orakel (Cimiano et al., 2007)
ORAKEL home page (University of Karlsruhe)
Introduction (google html version, scroll down a bit)
The project seems being developed in Java.
3. Librarian (Serge Linckels, 2007)
Home page
4. Querix (Kaufmann et al., 2006)
Introduction by Esther Kaufmann, Abraham Bernstein, and Renato Zumstein (University of Zurich)
5. Aqualog: A portable question-answering system (Knowledge Media Institute, The Open University)
Introduction by Vanessa Lopez, Michele Pasin, and Enrico Motta
6. SemSearch: A Search Engine for the Semantic Web (Knowledge Media Institute, The Open University)
Introduction by Yuangui Lei, Victoria Uren, and Enrico Motta
An interesting search engine: libra.msra.cn
It gives access to many academic resources. It looks good, check out the about page.
A list of commercial implementations:
True Knowledge
You can install a Firefox plugin to enhance your search experience and ask questions like: "What is the distance between the earth and the moon?".
This a brief list of software projects and articles related to transforming a human typed query to languages which are processable by computers.
1. QuestIO
QuestIO white paper by Danica Damljanovic, Valentin Tablan, Kalina Bontcheva (University of Sheffield)
It contains a short description of the projects covered in this post.
A good video presentation to QuestIO by Valentin Tablan
2. Orakel (Cimiano et al., 2007)
ORAKEL home page (University of Karlsruhe)
Introduction (google html version, scroll down a bit)
The project seems being developed in Java.
3. Librarian (Serge Linckels, 2007)
Home page
4. Querix (Kaufmann et al., 2006)
Introduction by Esther Kaufmann, Abraham Bernstein, and Renato Zumstein (University of Zurich)
5. Aqualog: A portable question-answering system (Knowledge Media Institute, The Open University)
Introduction by Vanessa Lopez, Michele Pasin, and Enrico Motta
6. SemSearch: A Search Engine for the Semantic Web (Knowledge Media Institute, The Open University)
Introduction by Yuangui Lei, Victoria Uren, and Enrico Motta
An interesting search engine: libra.msra.cn
It gives access to many academic resources. It looks good, check out the about page.
A list of commercial implementations:
True Knowledge
You can install a Firefox plugin to enhance your search experience and ask questions like: "What is the distance between the earth and the moon?".
Posted by Anton Andreev
in Techno-talk
at
10:40
| Comment (1)
| Trackbacks (0)
Last modified on 2009-06-16 14:59
My interests list as of June 2009
Semantic Web:
- text based query interface for the semantic web (Think ask.com). Check my post on the topic.
Bio-devices:
- BCI (Brain Computer Interface). We have an EEG(bg)(en) which we do not use (wanna buy it?).
- Artificial Neuron in Medicine (the one in Mathematics is more famous). Check out this article.
Green energy:
- Producing energy from warm air convection or from Atmospheric Pressure Differences over Geographically-Spaced Sites. You can check this youtube video for a cheap simple way to produce energy from hot air. That's what I want to build, it seems that it really works even in cloudy and rainy days. And another video.
- A better Photovoltaics are now available called Focused Photovoltaics.
- text based query interface for the semantic web (Think ask.com). Check my post on the topic.
Bio-devices:
- BCI (Brain Computer Interface). We have an EEG(bg)(en) which we do not use (wanna buy it?).
- Artificial Neuron in Medicine (the one in Mathematics is more famous). Check out this article.
Green energy:
- Producing energy from warm air convection or from Atmospheric Pressure Differences over Geographically-Spaced Sites. You can check this youtube video for a cheap simple way to produce energy from hot air. That's what I want to build, it seems that it really works even in cloudy and rainy days. And another video.
- A better Photovoltaics are now available called Focused Photovoltaics.
Posted by Anton Andreev
in Techno-talk
at
10:12
| Comment (1)
| Trackbacks (0)
Last modified on 2009-06-26 14:57
Thursday, June 4. 2009
.NET tools for the Semantic Web
I have found two projects:
The first project is Semweb(Semantic Web/RDF Library) for C#/.NET developed by Joshua Tauberer. It can read and write RDFs, make queries. It is written in C# and full source is provided under a license. Also a persistent storage can be plugged like SqlLite, SqlServer, MySql. Sounds good, until you read that Joshua is taking some rest from the project and it won't be supporting Semweb that much, which is kind of discouraging and it does not make software companies feel comforatable to adopt this technology.
The second project is LINQtoRDF which is a great idea. The idea is that an ontology is like an SQL schema. So a tool called RDFMetal is provided much like the SQLMetal so one can build an ontology in the form of a C# class. This is fantastic because it gives you the ability to write easily type checked semantic queries via Linq. Also it will be great to use TPL(Task Parallel Library) to speed up queries. It could be called PLinqToRDF like the PLINQ project.
Now let's talk about the bad stuff:
- I could not use RDFMetal against DBpedia endpoint http://DBpedia.org/sparql as it was supposed to work(so far).
- The Google mailing list seems dead and my posts are not being published (so far).
- LinqToRdf is using Semantic Web/RDF Library for C#/.NET , so its uncertain future affects LinqToRdf directly.
So it seems like the Java world have a definite advantage here and .NET has a potential to create something even better than what is available in Java, but this needs more work and most of all it needs support.
What an open source project needs most in order to be successfull:
- working code examples. Demos in html or pdf are often hard to reproduce. Examples should be part of the release process, so one can be sure they really work.
The first project is Semweb(Semantic Web/RDF Library) for C#/.NET developed by Joshua Tauberer. It can read and write RDFs, make queries. It is written in C# and full source is provided under a license. Also a persistent storage can be plugged like SqlLite, SqlServer, MySql. Sounds good, until you read that Joshua is taking some rest from the project and it won't be supporting Semweb that much, which is kind of discouraging and it does not make software companies feel comforatable to adopt this technology.
The second project is LINQtoRDF which is a great idea. The idea is that an ontology is like an SQL schema. So a tool called RDFMetal is provided much like the SQLMetal so one can build an ontology in the form of a C# class. This is fantastic because it gives you the ability to write easily type checked semantic queries via Linq. Also it will be great to use TPL(Task Parallel Library) to speed up queries. It could be called PLinqToRDF like the PLINQ project.
Now let's talk about the bad stuff:
- I could not use RDFMetal against DBpedia endpoint http://DBpedia.org/sparql as it was supposed to work(so far).
- The Google mailing list seems dead and my posts are not being published (so far).
- LinqToRdf is using Semantic Web/RDF Library for C#/.NET , so its uncertain future affects LinqToRdf directly.
So it seems like the Java world have a definite advantage here and .NET has a potential to create something even better than what is available in Java, but this needs more work and most of all it needs support.
What an open source project needs most in order to be successfull:
- working code examples. Demos in html or pdf are often hard to reproduce. Examples should be part of the release process, so one can be sure they really work.
Posted by Anton Andreev
in Techno-talk
at
12:13
| Comments (0)
| Trackbacks (0)
Last modified on 2009-06-05 18:32
Sunday, May 31. 2009
Strict Rules vs Machine Learning - KIM part 6
Summary:
There are generally two ways to recognize entities from text articles when using Ontotext Kim. Example entities are: people, organizations, locations.
Both methods have their strengths and weaknesses. Things that can not detected by humans can not also be detected by computers.
Using strict rules(better known as Knowledge Engineering)
These rules are implemented by some regular expression language. In this case it is Jape.
The more you customize the rules to detect what you need, the better results you get.
Advantages:
- it you have some rules already available (e.x. for date, money ...) then it might be faster to create the new rules you need and get the job done fast
- a smaller sample corpus might be required in some cases than when using machine learning
- in general effectiveness is bound to the amount of efforts that are needed to produce better rules
Weaknesses:
- in practice the rules might become quite complicated and hard to support. Imagine a 20KB file that describes only one entity. You end up not reading the previous rules and modifying one of them, but rather adding the specific case that was missing in the end of the file and thus increasing the total length of the file and the total complexity of the rules. This is especially true when different people are modifying these rules.
Machine learning
In order to use machine learning you need a framework that implement several machine learning algorithms. You as an expert can define features which will be taken in consideration when the framework is processing the example data:
- consider the length of the word
- consider the case-sensitivity
- consider the case-sensitivity of the previous word
- consider prefixes and suffixes
The idea is not to set the exact rules, but rather make the framework build them itself from specific parts of the text you told the framework to pay attention to. Then you need to supply the machine-learning framework with enough test articles.
Advantages:
- it can give better results than strict rules
Often efforts needed to achieve 80% effectiveness are as much as from 80% to 85%.
Weaknesses:
- needs parameter and algorithm testing (that's actually not such a problem, it just needs some work hours)
- needs more example articles by a factor of 10(assumption) than using strict rules
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. Features in the case of MALLET are either true or false. MALLET provides several algorithms and each of them has its own configuration parameters. MALLET is implemented as plug-in for KIM/GATE.
There is also a second machine learning framework called openNLP that is also implemented as GATE plug-in in Ontotext KIM, and might soon be released as part of the standard KIM/GATE release.
Conclusions/final thoughts:
Both rule-based and machine learning are supported by the custom GATE pipeline for semantic annotation developed by Ontotext for the KIM platform.
In short term it is better to use strict regular expression rules(like Jape) as it gives you results almost momentarily, but in a long term (from both complexity and effectiveness point of view) is definitely better to use a machine learning system like MALLET. Of course a combination of the two should work best, where rules are used first(some of them negative) and then machine learning is applied.
Another machine learning project is: edlin.sourceforge.net
This a short introduction, contact Ontotext for more detailed information.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
There are generally two ways to recognize entities from text articles when using Ontotext Kim. Example entities are: people, organizations, locations.
Both methods have their strengths and weaknesses. Things that can not detected by humans can not also be detected by computers.
Using strict rules(better known as Knowledge Engineering)
These rules are implemented by some regular expression language. In this case it is Jape.
The more you customize the rules to detect what you need, the better results you get.
Advantages:
- it you have some rules already available (e.x. for date, money ...) then it might be faster to create the new rules you need and get the job done fast
- a smaller sample corpus might be required in some cases than when using machine learning
- in general effectiveness is bound to the amount of efforts that are needed to produce better rules
Weaknesses:
- in practice the rules might become quite complicated and hard to support. Imagine a 20KB file that describes only one entity. You end up not reading the previous rules and modifying one of them, but rather adding the specific case that was missing in the end of the file and thus increasing the total length of the file and the total complexity of the rules. This is especially true when different people are modifying these rules.
Machine learning
In order to use machine learning you need a framework that implement several machine learning algorithms. You as an expert can define features which will be taken in consideration when the framework is processing the example data:
- consider the length of the word
- consider the case-sensitivity
- consider the case-sensitivity of the previous word
- consider prefixes and suffixes
The idea is not to set the exact rules, but rather make the framework build them itself from specific parts of the text you told the framework to pay attention to. Then you need to supply the machine-learning framework with enough test articles.
Advantages:
- it can give better results than strict rules
Often efforts needed to achieve 80% effectiveness are as much as from 80% to 85%.
Weaknesses:
- needs parameter and algorithm testing (that's actually not such a problem, it just needs some work hours)
- needs more example articles by a factor of 10(assumption) than using strict rules
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. Features in the case of MALLET are either true or false. MALLET provides several algorithms and each of them has its own configuration parameters. MALLET is implemented as plug-in for KIM/GATE.
There is also a second machine learning framework called openNLP that is also implemented as GATE plug-in in Ontotext KIM, and might soon be released as part of the standard KIM/GATE release.
Conclusions/final thoughts:
Both rule-based and machine learning are supported by the custom GATE pipeline for semantic annotation developed by Ontotext for the KIM platform.
In short term it is better to use strict regular expression rules(like Jape) as it gives you results almost momentarily, but in a long term (from both complexity and effectiveness point of view) is definitely better to use a machine learning system like MALLET. Of course a combination of the two should work best, where rules are used first(some of them negative) and then machine learning is applied.
Another machine learning project is: edlin.sourceforge.net
This a short introduction, contact Ontotext for more detailed information.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Posted by Anton Andreev
in Techno-talk
at
10:32
| Comments (0)
| Trackbacks (0)
Last modified on 2009-11-26 14:34
Thursday, May 28. 2009
Using a Gate application - KIM part 4
Summary:
This short article shows you how to integrate a GATE module(application) in Ontotext KIM and consume it through you own Java product.
1. Configuration:
KIM provides an API through RMI on default port 1099. This page provides everything you need to configure RMI and KIM.
Eclipse->Build Path->Add external libraries:
kim-api.jar
sesame-1.2.7-ONTO.jar
2. It is recommended that you use GATE provided by the KIM distribution. Use "startKIMGate.bat" in \kim-platform-2.4-SNAPSHOT\bin.
You need to create a "Conditional Corpus Pipeline" application in GATE, so that KIM can utilize it successfully. ANNIE is not such a type of GATE application, so you will get a type mismatch if you use ANNIE or a modified version of it. The trick is to create a new "Conditional Corpus Pipeline" application and add all ANNIE's processing resources, plus your own to the newly created "Conditional Corpus Pipeline" application. Then you need to make sure these resources are in the same order as they were in the ANNIE application! This problem has been fixed for version 3.0 and above, so you ca now use ANNIE or a modified version of it from KIM.
3. Save your application to \kim-platform-2.4-SNAPSHOT\context\default\resources\mycondapp.gapp. To do that: right-click on a GATE application and select "Save application state".
4. Edit the file \kim-platform-2.4-SNAPSHOT\config\nerc.properties and modify the line:
com.ontotext.kim.KIMConstants.IE_APP=IE.gapp
to
com.ontotext.kim.KIMConstants.IE_APP=IE.gapp,mycondapp.gapp
All applications are separated by comma.
5. Executing our GATE application from KIM:
import com.ontotext.*;
import com.ontotext.kim.client.GetService;
import com.ontotext.kim.client.KIMService;
import com.ontotext.kim.client.semanticannotation.SemanticAnnotationAPI;
public class KIM {
public static final String RMI_HOST = "localhost";//not used
public static final int RMI_PORT = 1099; //not used
public static void main(String[] args) {
try
{
KIMService serviceKim = GetService.from();
System.out.println("KIM Platform : " + serviceKim.getPlatformVersion());
System.out.println("KIM Server : " + serviceKim.getServerVersion());
System.out.println("KB Version : " + serviceKim.getKBVersion());
// obtain CorporaAPI and SemanticAnnotationAPI components
SemanticAnnotationAPI apiSemAnn1 = serviceKim.getSemanticAnnotationAPI("mycondapp.gapp");
String content =
"Blair and Bush ? are they doing the right thing for Iraq, America," +
" Europe, the Earth... for civilization... " +
"or just guided by their blinded eyes are in favor of the big coporations:" +
"enter here new unrecognized corporations with a clue suffix:" +
"MicroZoftRR Inc.";
apiSemAnn1.execute(content);
}
catch(Exception ex)
{
System.out.println(ex.getMessage());
}
System.out.println("Done!");
}
}
You can download this working sample from here.
Software versions: KIM 2.4, GATE 4.0 (integrated with KIM), Eclipse 3.2
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
This short article shows you how to integrate a GATE module(application) in Ontotext KIM and consume it through you own Java product.
1. Configuration:
KIM provides an API through RMI on default port 1099. This page provides everything you need to configure RMI and KIM.
Eclipse->Build Path->Add external libraries:
kim-api.jar
sesame-1.2.7-ONTO.jar
2. It is recommended that you use GATE provided by the KIM distribution. Use "startKIMGate.bat" in \kim-platform-2.4-SNAPSHOT\bin.
You need to create a "Conditional Corpus Pipeline" application in GATE, so that KIM can utilize it successfully. ANNIE is not such a type of GATE application, so you will get a type mismatch if you use ANNIE or a modified version of it. The trick is to create a new "Conditional Corpus Pipeline" application and add all ANNIE's processing resources, plus your own to the newly created "Conditional Corpus Pipeline" application. Then you need to make sure these resources are in the same order as they were in the ANNIE application! This problem has been fixed for version 3.0 and above, so you ca now use ANNIE or a modified version of it from KIM.
3. Save your application to \kim-platform-2.4-SNAPSHOT\context\default\resources\mycondapp.gapp. To do that: right-click on a GATE application and select "Save application state".
4. Edit the file \kim-platform-2.4-SNAPSHOT\config\nerc.properties and modify the line:
com.ontotext.kim.KIMConstants.IE_APP=IE.gapp
to
com.ontotext.kim.KIMConstants.IE_APP=IE.gapp,mycondapp.gapp
All applications are separated by comma.
5. Executing our GATE application from KIM:
import com.ontotext.*;
import com.ontotext.kim.client.GetService;
import com.ontotext.kim.client.KIMService;
import com.ontotext.kim.client.semanticannotation.SemanticAnnotationAPI;
public class KIM {
public static final String RMI_HOST = "localhost";//not used
public static final int RMI_PORT = 1099; //not used
public static void main(String[] args) {
try
{
KIMService serviceKim = GetService.from();
System.out.println("KIM Platform : " + serviceKim.getPlatformVersion());
System.out.println("KIM Server : " + serviceKim.getServerVersion());
System.out.println("KB Version : " + serviceKim.getKBVersion());
// obtain CorporaAPI and SemanticAnnotationAPI components
SemanticAnnotationAPI apiSemAnn1 = serviceKim.getSemanticAnnotationAPI("mycondapp.gapp");
String content =
"Blair and Bush ? are they doing the right thing for Iraq, America," +
" Europe, the Earth... for civilization... " +
"or just guided by their blinded eyes are in favor of the big coporations:" +
"enter here new unrecognized corporations with a clue suffix:" +
"MicroZoftRR Inc.";
apiSemAnn1.execute(content);
}
catch(Exception ex)
{
System.out.println(ex.getMessage());
}
System.out.println("Done!");
}
}
You can download this working sample from here.
Software versions: KIM 2.4, GATE 4.0 (integrated with KIM), Eclipse 3.2
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Posted by Anton Andreev
in Techno-talk
at
21:46
| Comments (2)
| Trackbacks (0)
Last modified on 2009-11-17 13:47
Wednesday, May 20. 2009
Nexenta vs Debian GNU / kFreeBSD
What is an OS:
- some kernel
- libc - a layer between the kernel and the user programs
- packages
Nexenta is using a Solaris kernel and Debian/Ubuntu packages. Debian GNU / kFreeBSD is using the FreeBSD kernel, but on the surface it should not be much different from a standard Debian edition. Nexenta is using Solaris libc, but Debian GNU / kFreeBSD is using a ported version of GNU libc (glibc) to the FreeBSD kernel. Linux is only a kernel, not a fully operational OS. If we look at the Debian policy we will see that using another kernel is not a problem, but the important is the usage of the GNU libc. Unfortunately there is no port of GNU libc to Solaris. Nevertheless Nexenta is doing fine, although porting packages to Nexenta is probably harder than porting packages Debian GNU / kFreeBSD because of the missing GNU libc on Solaris. Both operation system have already a considerable amount of packages.
Update: In build 107 of Opensolaris the Solaris libc has been released with some compatibility functions for Linux/BSD which will help porting debian packages in Nexenta.
My opinion is that these hybrids are quite welcome as I have doubts about the quality of the Linux kernel.
Freebsd is a pure OS meaning that the Freebsd community produces its own kernel, libc and some of the programs that comes with the OS. This makes everything in the OS much more consistent.
As of 20.05.2009:
1. Debian GNU / kFreeBSD is an official Debian port, which means that all packages and translations should be available and tested in the next Debian release - squeeze, at least I think so.
2. Luca Favatella is working on port of the Debian installer to the Debian GNU / kFreeBSD. You could track his status here.
3. Gnome and xfce4 are available on both ... but do not expect things to work right out of the box.
4. StormOS is a distro based on Nexenta and provides a xfce4 desktop.
Install Sun JDK on Nexenta or StormOS:
I am using Nexenta Core 2 RC3, Nexenta Core 2.0 has been released.
Issue the following commands:
#apt-get update
#apt-get install sun-java6-jdk
#apt-get install sunwlibc
Now you should have the jdk in /usr/lib/jvm
Install Tomcat
You need to download the Tomcat binaries from the Tomcat website. There is no deb package available at the time of writing this post. Tomcat is supposed to be pure Java, so it should work on Nexenta and Debian GNU / kFreeBSD assuming the Java JDK is installed correctly.
Install Ontotext KIM on Nexenta or StormOS
#apt-get install unzip
#unzip kim-platform-2.x
#cd kim-platform-2.x\bin
#nano config_machine.sh
set JAVA_HOME="/usr/lib/jvm/java-6-sun"
set _TOMCAT_HOME="your tomcat location"
#chmod +x startKIM.sh
#./startKIM.sh
In order to use KIM clients (Web services or the Demo website) you need Tomcat(see above).
Copy the wars from \kim-platform-2.x\KIM Clients\ to \Tomcat\webapps. You can check my Install Ontotext KIM post for more information on how see KIM in action.
- some kernel
- libc - a layer between the kernel and the user programs
- packages
Nexenta is using a Solaris kernel and Debian/Ubuntu packages. Debian GNU / kFreeBSD is using the FreeBSD kernel, but on the surface it should not be much different from a standard Debian edition. Nexenta is using Solaris libc, but Debian GNU / kFreeBSD is using a ported version of GNU libc (glibc) to the FreeBSD kernel. Linux is only a kernel, not a fully operational OS. If we look at the Debian policy we will see that using another kernel is not a problem, but the important is the usage of the GNU libc. Unfortunately there is no port of GNU libc to Solaris. Nevertheless Nexenta is doing fine, although porting packages to Nexenta is probably harder than porting packages Debian GNU / kFreeBSD because of the missing GNU libc on Solaris. Both operation system have already a considerable amount of packages.
Update: In build 107 of Opensolaris the Solaris libc has been released with some compatibility functions for Linux/BSD which will help porting debian packages in Nexenta.
My opinion is that these hybrids are quite welcome as I have doubts about the quality of the Linux kernel.
Freebsd is a pure OS meaning that the Freebsd community produces its own kernel, libc and some of the programs that comes with the OS. This makes everything in the OS much more consistent.
As of 20.05.2009:
1. Debian GNU / kFreeBSD is an official Debian port, which means that all packages and translations should be available and tested in the next Debian release - squeeze, at least I think so.
2. Luca Favatella is working on port of the Debian installer to the Debian GNU / kFreeBSD. You could track his status here.
3. Gnome and xfce4 are available on both ... but do not expect things to work right out of the box.
4. StormOS is a distro based on Nexenta and provides a xfce4 desktop.
Install Sun JDK on Nexenta or StormOS:
I am using Nexenta Core 2 RC3, Nexenta Core 2.0 has been released.
Issue the following commands:
#apt-get update
#apt-get install sun-java6-jdk
#apt-get install sunwlibc
Now you should have the jdk in /usr/lib/jvm
Install Tomcat
You need to download the Tomcat binaries from the Tomcat website. There is no deb package available at the time of writing this post. Tomcat is supposed to be pure Java, so it should work on Nexenta and Debian GNU / kFreeBSD assuming the Java JDK is installed correctly.
Install Ontotext KIM on Nexenta or StormOS
#apt-get install unzip
#unzip kim-platform-2.x
#cd kim-platform-2.x\bin
#nano config_machine.sh
set JAVA_HOME="/usr/lib/jvm/java-6-sun"
set _TOMCAT_HOME="your tomcat location"
#chmod +x startKIM.sh
#./startKIM.sh
In order to use KIM clients (Web services or the Demo website) you need Tomcat(see above).
Copy the wars from \kim-platform-2.x\KIM Clients\ to \Tomcat\webapps. You can check my Install Ontotext KIM post for more information on how see KIM in action.
Posted by Anton Andreev
in Techno-talk
at
02:13
| Comments (0)
| Trackbacks (0)
Last modified on 2009-07-10 17:50
Monday, May 18. 2009
GATE tutorial - KIM part 3
Summary:
This a beginners GATE tutorial.
GATE is a tool for (NLP)Natural Language Processing. GATE helps you extract data from text articles, which you can turn into a computer knowledge. It provides you a development IDE that helps you create and test an application. Once you are done you can have your application executed from JAVA the same way you did from the IDE. GATE Applications ca be incorporated in Ontotext KIM
First you should read the user guide. Also I am using GATE-5.0-beta1 build 3048, Eclipse 3.4.2(used in the Java sample) on Windows XP SP3.
Let's say we want to find the relation when company A acquires company B .
Gather some example articles. Create corpus. "Corpus" is just a funny name for a group of articles. My articles are here.
We are going to focus on English articles. GATE gives you the ability to create you own text processing applications. Maybe there are already such GATE applications that are good and can be used for general purpose text processing. "Application" is GATE term, we are not talking about applications in general. The point is that we could create an application from scratch, it is not that hard, but it is always better and most of all easier to improve upon something.
We choose a GATE application which is called "ANNIE". Indeed ANNIE is not some application, it is an integral part of GATE itself. You should try to process some articles with ANNIE, see the Annotation Sets and Annotation List and get used to them. Keep in mind ANNIE is primary English orientated.
Next we need to add additional functionality to ANNIE. We could add a lot of different stuff, but we add a "Jape Transducer" which points to a file where we describe what should be detected in our articles. That file is a "Jape file". Don't think what is Jape right now. Next click on ANNIE. You will see some processing resources (on the right). The Jape transducer we've just created is such a processing resource. We need to add it to the right. You need to know that these processing resources work at different levels and each can depend from the output of others. That's the meaning of "pipe" in GATE. So it will be best if we leave our new processing resources as last (the bottom) on the list.
Yes, next is Jape. Jape is a language similar to regular expressions. We are going to use acquire.jape. It has two rules. Also I did put it in \GATE-5.0-beta1\plugins\ANNIE\resources\NE\grammar\acquire.jape.
I have also modified ANNIE Gazetteer in \GATE-5.0-beta1\plugins\ANNIE\resources\gazetteer\company.lst by adding two lines:
MySQL
MySql
to make sure MySql is recognized as a company.
The whole idea is to make your enhanced ANNIE work by supplying a correct Jape grammar and test it. Then you save your application to a file. You do that by right-clicking on a GATE application and select "Save application state".
1. You should save your application with "gapp" extension (no problem if you do not).
2. It is better if you remove the corpus in your application before saving, cause that corpus will become one more dependency to your application.
Gapp files are simply XML files which describe where is everything you use in your GATE application. This means you can change them. For what? You won't need to modify them as far as you continue to use the gapp file/application form the location where you saved it and you did not change the location of your GATE installation. If you change myapp.gapp from c:\gatetest to d:\work\gatetest you will see that things will probably go wrong. Modifying the paths is easy.
Next we create a normal Java console application. We add all jars in gate/lib. We check to make sure we added ALL the jars! I was having problems because I've ignored some.
And then what?
Thank you about this question. Then we use this java code:
import gate.Annotation;
import gate.Document;
import gate.Corpus;
import gate.CorpusController;
import gate.FeatureMap;
import gate.AnnotationSet;
import gate.Gate;
import gate.Factory;
import gate.util.*;
import gate.util.persistence.PersistenceManager;
import java.util.Set;
import java.util.HashSet;
import java.util.List;
import java.util.ArrayList;
import java.util.Iterator;
import java.io.File;
import java.io.FileFilter;
import java.io.FileOutputStream;
import java.io.BufferedOutputStream;
import java.io.FilenameFilter;
import java.io.OutputStreamWriter;
public class BatchProcessApp {
public static void main(String[] args) throws Exception {
// initialise GATE - this must be done before calling any GATE APIs
Gate.init();
File[] files = getFilesFromDir("F:/Temp/articles");
// File gappFile = new File("g:/ModifiedAnnie.gapp");
File gappFile = new File("g:/annie_acquire_nocorpus.gapp");
// load the saved application
CorpusController application = (CorpusController) PersistenceManager
.loadObjectFromFile(gappFile);
// Create a Corpus to use. We recycle the same Corpus object for each
// iteration. The string parameter to newCorpus() is simply the
// GATE-internal name to use for the corpus. It has no particular
// significance.
Corpus corpus = Factory.newCorpus("BatchProcessApp Corpus");
application.setCorpus(corpus);
// process the files one by one
for (int i = 0; i < files.length; i++) {
if (!files[i].getName().endsWith(".txt"))
continue;
// load the document (using the specified encoding if one was given)
File docFile = files[i];// new File(args[i]);
System.out.print("Processing document " + docFile + "...");
Document doc = Factory.newDocument(docFile.toURL(), encoding);
// put the document in the corpus
corpus.add(doc);
// run the application
application.execute();
// remove the document from the corpus again
corpus.clear();
// we only extract annotations from the default (unnamed)
// AnnotationSet
// in this example
AnnotationSet defaultAnnots = doc.getAnnotations();
System.out.println();
for (Annotation ann : defaultAnnots) {
FeatureMap map = ann.getFeatures();
if (map.get("relationType") != null)
System.out.println("## " + map.get("relationType")
+ " #CompanyA=" + map.get("companyA")
+ " #CompanyB=" + map.get("companyB")
);
}
Factory.deleteResource(doc);
System.out.println("done");
} // for each file
System.out.println("All done");
}
private static String encoding = null;
private static File[] getFilesFromDir(String path) {
File dir = new File(path);
File[] files = dir.listFiles();
return files;
}
}
You can view/download the source from code.google.com
The Gate.init() should be called only once! To run this code you need to set the path to your gapp file and the location of your articles(no folder recursive scanning). Note that we could load all documents in one corpus, but instead the code loads only one document per corpus, this helps system resources to be utilized better. Also this code sample will display only annotations that have a feature="relationType". You should make it display everything.
You can see that the code instantiates the GAPE application (a modified ANNIE) in the form of a "CorpusController" and it is named "application"
. It is something like Java/.NET remoting, but you point it to a gapp file, which has all the meta information to construct the object/the application.
Conclusion:
Data extraction with GATE can be done, it just needs reading through documentation, post questions to the GATE mailing list.
Maybe it is a good idea to create a Linux(or FreeBSD, OpenSolaris, Nexenta) vmware image(or Xen, VirtualBox) which has GATE, Eclipse and GATE's samples installed and working properly.
Disclaimer:
This Java sample is based on this sample.
I am a GATE newbie, so do not expect for now that I would be able to answer your questions.
Credits:
Special thanks goes to: everyone from the GATE mailing list, Marin Nozhchev(Ontotext), Stanislav Zlatinov.
Todo:
Explain the Jape code.
Add my articles.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
This a beginners GATE tutorial.
GATE is a tool for (NLP)Natural Language Processing. GATE helps you extract data from text articles, which you can turn into a computer knowledge. It provides you a development IDE that helps you create and test an application. Once you are done you can have your application executed from JAVA the same way you did from the IDE. GATE Applications ca be incorporated in Ontotext KIM
First you should read the user guide. Also I am using GATE-5.0-beta1 build 3048, Eclipse 3.4.2(used in the Java sample) on Windows XP SP3.
Let's say we want to find the relation when company A acquires company B .
Gather some example articles. Create corpus. "Corpus" is just a funny name for a group of articles. My articles are here.
We are going to focus on English articles. GATE gives you the ability to create you own text processing applications. Maybe there are already such GATE applications that are good and can be used for general purpose text processing. "Application" is GATE term, we are not talking about applications in general. The point is that we could create an application from scratch, it is not that hard, but it is always better and most of all easier to improve upon something.
We choose a GATE application which is called "ANNIE". Indeed ANNIE is not some application, it is an integral part of GATE itself. You should try to process some articles with ANNIE, see the Annotation Sets and Annotation List and get used to them. Keep in mind ANNIE is primary English orientated.
Next we need to add additional functionality to ANNIE. We could add a lot of different stuff, but we add a "Jape Transducer" which points to a file where we describe what should be detected in our articles. That file is a "Jape file". Don't think what is Jape right now. Next click on ANNIE. You will see some processing resources (on the right). The Jape transducer we've just created is such a processing resource. We need to add it to the right. You need to know that these processing resources work at different levels and each can depend from the output of others. That's the meaning of "pipe" in GATE. So it will be best if we leave our new processing resources as last (the bottom) on the list.
Yes, next is Jape. Jape is a language similar to regular expressions. We are going to use acquire.jape. It has two rules. Also I did put it in \GATE-5.0-beta1\plugins\ANNIE\resources\NE\grammar\acquire.jape.
I have also modified ANNIE Gazetteer in \GATE-5.0-beta1\plugins\ANNIE\resources\gazetteer\company.lst by adding two lines:
MySQL
MySql
to make sure MySql is recognized as a company.
The whole idea is to make your enhanced ANNIE work by supplying a correct Jape grammar and test it. Then you save your application to a file. You do that by right-clicking on a GATE application and select "Save application state".
1. You should save your application with "gapp" extension (no problem if you do not).
2. It is better if you remove the corpus in your application before saving, cause that corpus will become one more dependency to your application.
Gapp files are simply XML files which describe where is everything you use in your GATE application. This means you can change them. For what? You won't need to modify them as far as you continue to use the gapp file/application form the location where you saved it and you did not change the location of your GATE installation. If you change myapp.gapp from c:\gatetest to d:\work\gatetest you will see that things will probably go wrong. Modifying the paths is easy.
Next we create a normal Java console application. We add all jars in gate/lib. We check to make sure we added ALL the jars! I was having problems because I've ignored some.
And then what?
Thank you about this question. Then we use this java code:
import gate.Annotation;
import gate.Document;
import gate.Corpus;
import gate.CorpusController;
import gate.FeatureMap;
import gate.AnnotationSet;
import gate.Gate;
import gate.Factory;
import gate.util.*;
import gate.util.persistence.PersistenceManager;
import java.util.Set;
import java.util.HashSet;
import java.util.List;
import java.util.ArrayList;
import java.util.Iterator;
import java.io.File;
import java.io.FileFilter;
import java.io.FileOutputStream;
import java.io.BufferedOutputStream;
import java.io.FilenameFilter;
import java.io.OutputStreamWriter;
public class BatchProcessApp {
public static void main(String[] args) throws Exception {
// initialise GATE - this must be done before calling any GATE APIs
Gate.init();
File[] files = getFilesFromDir("F:/Temp/articles");
// File gappFile = new File("g:/ModifiedAnnie.gapp");
File gappFile = new File("g:/annie_acquire_nocorpus.gapp");
// load the saved application
CorpusController application = (CorpusController) PersistenceManager
.loadObjectFromFile(gappFile);
// Create a Corpus to use. We recycle the same Corpus object for each
// iteration. The string parameter to newCorpus() is simply the
// GATE-internal name to use for the corpus. It has no particular
// significance.
Corpus corpus = Factory.newCorpus("BatchProcessApp Corpus");
application.setCorpus(corpus);
// process the files one by one
for (int i = 0; i < files.length; i++) {
if (!files[i].getName().endsWith(".txt"))
continue;
// load the document (using the specified encoding if one was given)
File docFile = files[i];// new File(args[i]);
System.out.print("Processing document " + docFile + "...");
Document doc = Factory.newDocument(docFile.toURL(), encoding);
// put the document in the corpus
corpus.add(doc);
// run the application
application.execute();
// remove the document from the corpus again
corpus.clear();
// we only extract annotations from the default (unnamed)
// AnnotationSet
// in this example
AnnotationSet defaultAnnots = doc.getAnnotations();
System.out.println();
for (Annotation ann : defaultAnnots) {
FeatureMap map = ann.getFeatures();
if (map.get("relationType") != null)
System.out.println("## " + map.get("relationType")
+ " #CompanyA=" + map.get("companyA")
+ " #CompanyB=" + map.get("companyB")
);
}
Factory.deleteResource(doc);
System.out.println("done");
} // for each file
System.out.println("All done");
}
private static String encoding = null;
private static File[] getFilesFromDir(String path) {
File dir = new File(path);
File[] files = dir.listFiles();
return files;
}
}
You can view/download the source from code.google.com
The Gate.init() should be called only once! To run this code you need to set the path to your gapp file and the location of your articles(no folder recursive scanning). Note that we could load all documents in one corpus, but instead the code loads only one document per corpus, this helps system resources to be utilized better. Also this code sample will display only annotations that have a feature="relationType". You should make it display everything.
You can see that the code instantiates the GAPE application (a modified ANNIE) in the form of a "CorpusController" and it is named "application"
Conclusion:
Data extraction with GATE can be done, it just needs reading through documentation, post questions to the GATE mailing list.
Maybe it is a good idea to create a Linux(or FreeBSD, OpenSolaris, Nexenta) vmware image(or Xen, VirtualBox) which has GATE, Eclipse and GATE's samples installed and working properly.
Disclaimer:
This Java sample is based on this sample.
I am a GATE newbie, so do not expect for now that I would be able to answer your questions.
Credits:
Special thanks goes to: everyone from the GATE mailing list, Marin Nozhchev(Ontotext), Stanislav Zlatinov.
Todo:
Explain the Jape code.
Add my articles.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Posted by Anton Andreev
in Techno-talk
at
13:02
| Comment (1)
| Trackbacks (0)
Last modified on 2009-11-17 13:47
(Page 1 of 7, totaling 74 entries)
next page »





