This post simply proves that GATE 5.1 NLP framework runs on OpenSolaris 2009.06 without any hassle. OpenSolaris uses bash as its default shell script language.
1. Download the GATE archive. Mine is "gate-5.1-build3431-BIN.zip".
2. Unzip the archive:
# unzip gate-5.1-build3431-BIN.zip
3. Install SUNWj6dev in order to have the Sun JDK 1.6 and not only the JRE.
# pkg install SUNWj6dev
You can learn more about the pkg - OpenSolaris Image Packaging System from here.
4. Issue the following commands in bash:
# JAVA_HOME=/usr/jdk/instances/jdk1.6.0
# echo $JAVA_HOME
Make sure that your JDK is really in /usr/jdk/instances/jdk1.6.0
5. Go to the bin folder of GATE and simply type:
# ./gate.sh
Entries tagged as java
Friday, February 12. 2010
Running GATE on OpenSolaris
Thursday, December 17. 2009
Compiling GATE
This post explains some steps of how to download and set up the GATE source code. This is needed when you would like to improve something.
SVN checkout: https://gate.svn.sourceforge.net/svnroot/gate
Open Eclipse in a new workspace (recommended). Use File->Import->Existing Projects into workspace->select your gate source dir->Finish.
Use the "Java Element Filters" to hide all "Non-Java elements" to make your project more compact.
Update:
My problems were related to an error that prevented me from downloading all GATE files while svn checking out the source. The problem is due to the fact that a filename allowed on Linux is not allowed on Windows. It was about ".cow:no-iframe" and ":" is not allowed on Windows. This halts the entire svn checkout and made me do all sort of tweaks and patches. The GATE source is over 500 MB and 13 000 files, so make sure you have everything before trying to fix it like me. If you are having problems like me, you should try a checkout on Linux to see if it is an OS dependency. If you copy source from Linux to Windows then you need to check in Eclipse that Properties->Resource->Text File Encoding->other is set to "UTF8".
I am interested in refactoring the code of the Othomatcher. This is the processing resource that matches all annotations of the same entity that they are indeed referring to the same thing. This is needed when a person or an organization is mentioned in different forms in the same document.
SVN checkout: https://gate.svn.sourceforge.net/svnroot/gate
Open Eclipse in a new workspace (recommended). Use File->Import->Existing Projects into workspace->select your gate source dir->Finish.
Use the "Java Element Filters" to hide all "Non-Java elements" to make your project more compact.
Update:
My problems were related to an error that prevented me from downloading all GATE files while svn checking out the source. The problem is due to the fact that a filename allowed on Linux is not allowed on Windows. It was about ".cow:no-iframe" and ":" is not allowed on Windows. This halts the entire svn checkout and made me do all sort of tweaks and patches. The GATE source is over 500 MB and 13 000 files, so make sure you have everything before trying to fix it like me. If you are having problems like me, you should try a checkout on Linux to see if it is an OS dependency. If you copy source from Linux to Windows then you need to check in Eclipse that Properties->Resource->Text File Encoding->other is set to "UTF8".
I am interested in refactoring the code of the Othomatcher. This is the processing resource that matches all annotations of the same entity that they are indeed referring to the same thing. This is needed when a person or an organization is mentioned in different forms in the same document.
Wednesday, September 30. 2009
KIM/GATE Getting Started - part 1
When you install KIM(click here for install post), you have 5 options in general:
Hint: Always stop KIM with \bin\stopKIM.bat to prevent possible data corruption (in the OWLIM semantic database).
Hint: On Linux use the ".sh" extension, not ".bat" to run the scripts.
1. Use KIM-GATE UI
This will start both Sheffield GATE(gate.ac.uk) developer and the Ontotext KIM server. GATE developer is a Java desktop application that communicates with the KIM server automatically.
Use \bin\startKIMGate.bat(sh) to start it and it will load the KIM's default information extraction pipeline which actually has no name, you will see it as "Conditional Corpus Pipeline".

With the KIM/GATE developer you can start annotating documents. Simply said by annotating documents I mean that you will be able to see your documents marked with different colors. Every color will be an annotation, where an annotation can be: personal name, location, time, job position ... etc.
You need to:
- read the GATE documentation, watch the following flash movies.
- know that "Corpus" is a funky name for a bunch of documents. You populate a corpus with documents and then you annotate the corpus - actually the documents in it.
- explore the GATE architecture for things like: "processing resource"/"gate plug-in", "gate application" ... not that many
. Applications can be used from both the UI and the Java API.
- know that the GATE developer(from the KIM package) starts with a default application, but when you download GATE from the GATE website, you need to load GATE's default application(pipeline) from the menus with "Load ANNIE with defaults".
2. Import documents through the KIM populator tool
You need to start the KIM server first with \bin\startKIM.bat(sh).(not start startKIMGate.bat!)

Then start \bin\tools\toolPopulate.cmd which on Windows starts the populator with a classic Windows UI.
Point to a folder that contains text or html documents. Press "Start".
This tool will call the "AddDocument" method from the KIM Document Repository API. Usually this method can be configured what operations to be performed when called. It's default behavior is to:
- create a full-text search index over the document
- extract entities and add them to the OWLIM semantic database
Technical remarks:
Both the standalone KIM server(startKIM.bat) and the GATE developer(from the KIM installation) can work in the same time(startKIMGate.bat), but the populator tool needs a running KIM instance - it does not start one for you. You can not start the GATE developer(from the KIM installation) only and expect the populator tool to work.
The see your results you should whether use the KIM WEB UI (see point 4) or the Sesame WEB UI(see part 5) to make queries. “results” means entities, triples, etc.
3. Write a JAVA program that connects to KIM
You can perform semantic annotation make, queries(SeRQL, SPARQL only in 3.0) through a JAVA API. You can also search through KIM Document Repository full text search index.
Please check:
Gate tutorial - KIM part 3
Using a Gate application - KIM part 4
The see your results you can do that programmatically again or use:KIM WEB UI (see point 4) or the Sesame WEB UI(see part 5) to make queries.
4. Explore KIM standard front-end UI
You need Tomcat to start the default WEB UI. I have to admit that many efforts have been put in making this interface really good.
Copy KIM.war from KIM Clients to Tomcat's \webapps.
Start KIM with \bin\startKIM.bat(sh).
Start Tomcat.
With you browser open: http://localhost:8080/KIM/

To try the latest version of the user-interface - visit: latest_news.semanticannotation.com
Many of the features will be disable because KIM 2.4 requires CORE(CORE = Co-Occuring and Ranking of Entities) over a relational database. So a version of Oracle is required. Please check the documentation on How to enable CORE DB with Oracle. Future versions of KIM will be less dependent on Oracle.
5. Use the Aduna Sesame Web UI to write semantic queries
Start Sesame: \bin\startSesame.bat. You do not need to have the KIM server running - see below.
Copy sesame-web-ui.war from \KIM Clients\ to your Tomcat's \webapps folder.
Start Tomcat.
You may experience a problem with Tomcat. This if you have KIM.war deployed. When you start it Tomcat will try to run the KIM.war and if you do not have a KIM instance running this will result in some error messages. This is fixed by either starting KIM(startKIM.bat) or removing kim.war and the KIM folder from webapps in Tomcat. If you remove them then do not forget to add them again next time you try to use the KIM UI.
Start your browser and point it to http://localhost:8080/sesame-web-ui.
Now you probably need to press "Go >>" from the UI to confirm the KIM seasame server.
Next enter the default username/password: admin/admin.

Click on the top on "SeRQL-S" to enter a new select query. Keep in mind that this is a SeRQL query, not SPARQL.
SELECT company, name FROM
{company} <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> {<http://proton.semanticweb.org/2006/05/protonu#Company>},
{company} <http://proton.semanticweb.org/2006/05/protons#hasMainAlias> {alias},
{alias} <http://www.w3.org/2000/01/rdf-schema#label> {name}
This query will first display the company URI in the semantic database that uniquely identifies each company and the second column is the main name that this company is known by. More that 7000 comapnies should be dispalyed from KIM's default knowledge base.
Note: When you use startSesame.bat you start a new OWLIM instance. When you have started a KIM server, you also have OWLIM server running inside KIM. So you might end up with two OWLIM servers running, but that's OK as they are configured on different ports.
Conclusion:
Next step is the KIM documentation itself.
If you have any problems please use the Google Ontotext search engine or/and subscribe to the kim-discussion mailing list.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Hint: Always stop KIM with \bin\stopKIM.bat to prevent possible data corruption (in the OWLIM semantic database).
Hint: On Linux use the ".sh" extension, not ".bat" to run the scripts.
1. Use KIM-GATE UI
This will start both Sheffield GATE(gate.ac.uk) developer and the Ontotext KIM server. GATE developer is a Java desktop application that communicates with the KIM server automatically.
Use \bin\startKIMGate.bat(sh) to start it and it will load the KIM's default information extraction pipeline which actually has no name, you will see it as "Conditional Corpus Pipeline".

With the KIM/GATE developer you can start annotating documents. Simply said by annotating documents I mean that you will be able to see your documents marked with different colors. Every color will be an annotation, where an annotation can be: personal name, location, time, job position ... etc.
You need to:
- read the GATE documentation, watch the following flash movies.
- know that "Corpus" is a funky name for a bunch of documents. You populate a corpus with documents and then you annotate the corpus - actually the documents in it.
- explore the GATE architecture for things like: "processing resource"/"gate plug-in", "gate application" ... not that many
- know that the GATE developer(from the KIM package) starts with a default application, but when you download GATE from the GATE website, you need to load GATE's default application(pipeline) from the menus with "Load ANNIE with defaults".
2. Import documents through the KIM populator tool
You need to start the KIM server first with \bin\startKIM.bat(sh).(not start startKIMGate.bat!)

Then start \bin\tools\toolPopulate.cmd which on Windows starts the populator with a classic Windows UI.
Point to a folder that contains text or html documents. Press "Start".
This tool will call the "AddDocument" method from the KIM Document Repository API. Usually this method can be configured what operations to be performed when called. It's default behavior is to:
- create a full-text search index over the document
- extract entities and add them to the OWLIM semantic database
Technical remarks:
Both the standalone KIM server(startKIM.bat) and the GATE developer(from the KIM installation) can work in the same time(startKIMGate.bat), but the populator tool needs a running KIM instance - it does not start one for you. You can not start the GATE developer(from the KIM installation) only and expect the populator tool to work.
The see your results you should whether use the KIM WEB UI (see point 4) or the Sesame WEB UI(see part 5) to make queries. “results” means entities, triples, etc.
3. Write a JAVA program that connects to KIM
You can perform semantic annotation make, queries(SeRQL, SPARQL only in 3.0) through a JAVA API. You can also search through KIM Document Repository full text search index.
Please check:
Gate tutorial - KIM part 3
Using a Gate application - KIM part 4
The see your results you can do that programmatically again or use:KIM WEB UI (see point 4) or the Sesame WEB UI(see part 5) to make queries.
4. Explore KIM standard front-end UI
You need Tomcat to start the default WEB UI. I have to admit that many efforts have been put in making this interface really good.
Copy KIM.war from KIM Clients to Tomcat's \webapps.
Start KIM with \bin\startKIM.bat(sh).
Start Tomcat.
With you browser open: http://localhost:8080/KIM/

To try the latest version of the user-interface - visit: latest_news.semanticannotation.com
Many of the features will be disable because KIM 2.4 requires CORE(CORE = Co-Occuring and Ranking of Entities) over a relational database. So a version of Oracle is required. Please check the documentation on How to enable CORE DB with Oracle. Future versions of KIM will be less dependent on Oracle.
5. Use the Aduna Sesame Web UI to write semantic queries
Start Sesame: \bin\startSesame.bat. You do not need to have the KIM server running - see below.
Copy sesame-web-ui.war from \KIM Clients\ to your Tomcat's \webapps folder.
Start Tomcat.
You may experience a problem with Tomcat. This if you have KIM.war deployed. When you start it Tomcat will try to run the KIM.war and if you do not have a KIM instance running this will result in some error messages. This is fixed by either starting KIM(startKIM.bat) or removing kim.war and the KIM folder from webapps in Tomcat. If you remove them then do not forget to add them again next time you try to use the KIM UI.
Start your browser and point it to http://localhost:8080/sesame-web-ui.
Now you probably need to press "Go >>" from the UI to confirm the KIM seasame server.
Next enter the default username/password: admin/admin.

Click on the top on "SeRQL-S" to enter a new select query. Keep in mind that this is a SeRQL query, not SPARQL.
SELECT company, name FROM
{company} <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> {<http://proton.semanticweb.org/2006/05/protonu#Company>},
{company} <http://proton.semanticweb.org/2006/05/protons#hasMainAlias> {alias},
{alias} <http://www.w3.org/2000/01/rdf-schema#label> {name}
This query will first display the company URI in the semantic database that uniquely identifies each company and the second column is the main name that this company is known by. More that 7000 comapnies should be dispalyed from KIM's default knowledge base.
Note: When you use startSesame.bat you start a new OWLIM instance. When you have started a KIM server, you also have OWLIM server running inside KIM. So you might end up with two OWLIM servers running, but that's OK as they are configured on different ports.
Conclusion:
Next step is the KIM documentation itself.
If you have any problems please use the Google Ontotext search engine or/and subscribe to the kim-discussion mailing list.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Thursday, September 3. 2009
KIM Multi-threaded Clustered Client Application - KIM part 9
Summary:
Today we are going to talk about performance optimizations in the next version of KIM that will released by the end of this year. Its version number is 3.0 and most likely in will appear in October, but if needed the release will be postponed.
We are going to talk about both clustering(use of more than one KIM sever instance) and multi-threading. Threads are used for executing the KIM semantic annotator in parallel which returns annotated documents.
One of the most important setting to remember is configured in \config\nerc.properties:
# Maximum number of annotation processes that can run at the same time.
# If set to more than 1, KIM will load multiple copies of the pipelines listed in the IE_APP parameter above
# during initialization. Multiple copies of the pipeline allow for parallel annotation of up to that number of documents
# Default: 1 (parallel annotation disabled)
As you see by using this new parameter in KIM 3.0 you will get 6 instances of the pipeline, so that 6 documents could be processed(annotated) in the same time.
Now you need to take a look at the KIMProcessor I have written. The code is here. Keep in mind that this code has been created with a development build of the KIM 3.0.
Threads have been used to speed-up the supply of the documents to the KIM server. The problem is that this won't speed up you work much. If you supply KIM with too much documents and there is no free pipeline than probably your documents will be queued and you will only take memory.
You may set:
The threading functionality needs to be extended and would be useful in two cases:
1. When using the KIMProcessor with multiple KIM servers. You could set for example 5 physical machines with 1 KIM server each. The machine that is running the KIMProcessor is the one who reads the documents(Postgresql 8.4 in this example). So if you are reading the documents from a single standard hard-drive it is possible that you need to supply the articles faster now as you now have 5 servers with let's say 6 pipelines each which results in 30 pipelines. In this case the use of threading is definitely useful. Of course the threads won't help when you reach the I/O limit of your hard-drive.
2. If you have big documents they will be read slower and in the same time they will take more time to process. Using threads to supply the documents might again be too fast as all the pipelines might be busy again. A good example when you should use threads is when you load documents from a web-service and these documents are of normal news article size(not too big).
Note that in the KIMProcessor all the articles are first loaded in memory and then they are supplied to the KIM server async. The right way to code this is to use some kind of async calls to the database and use a synchronized blocking queue, so the moment a document is read it is being send to the KIM server.
Another cool feature to add to the KIMProcessor will be fail-safe support. When one of the servers is down, the documents that were sent to it should now be pulled back to the queue, so that another node in the cluster can process them. Also a automatic check should be done once in a while so that the server can be used again when it is back on-line.
The best solution is to implement KIM with Hadoop, but that will take time.
In Ontotext we have a different tested application which is used to process more than 100 000 documents. The one provided here(the KIMProcessor) is only maintained by me fro now.
Disclaimer:
Keep in mind that this post represents only my personal view of the topic. You should try different configurations and see how it works for yourself or probably use our tested tool.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Today we are going to talk about performance optimizations in the next version of KIM that will released by the end of this year. Its version number is 3.0 and most likely in will appear in October, but if needed the release will be postponed.
We are going to talk about both clustering(use of more than one KIM sever instance) and multi-threading. Threads are used for executing the KIM semantic annotator in parallel which returns annotated documents.
One of the most important setting to remember is configured in \config\nerc.properties:
# Maximum number of annotation processes that can run at the same time.
# If set to more than 1, KIM will load multiple copies of the pipelines listed in the IE_APP parameter above
# during initialization. Multiple copies of the pipeline allow for parallel annotation of up to that number of documents
# Default: 1 (parallel annotation disabled)
com.ontotext.kim.semanticannotation.PARALLEL_NERCS=6
As you see by using this new parameter in KIM 3.0 you will get 6 instances of the pipeline, so that 6 documents could be processed(annotated) in the same time.
Now you need to take a look at the KIMProcessor I have written. The code is here. Keep in mind that this code has been created with a development build of the KIM 3.0.
Threads have been used to speed-up the supply of the documents to the KIM server. The problem is that this won't speed up you work much. If you supply KIM with too much documents and there is no free pipeline than probably your documents will be queued and you will only take memory.
You may set:
com.ontotext.kim.semanticannotation.PARALLEL_NERCS=auto
and the number of pipelines will be equal to the number of processor cores reported by the OS (on Windows cmd: echo %NUMBER_OF_PROCESSORS%)The threading functionality needs to be extended and would be useful in two cases:
1. When using the KIMProcessor with multiple KIM servers. You could set for example 5 physical machines with 1 KIM server each. The machine that is running the KIMProcessor is the one who reads the documents(Postgresql 8.4 in this example). So if you are reading the documents from a single standard hard-drive it is possible that you need to supply the articles faster now as you now have 5 servers with let's say 6 pipelines each which results in 30 pipelines. In this case the use of threading is definitely useful. Of course the threads won't help when you reach the I/O limit of your hard-drive.
2. If you have big documents they will be read slower and in the same time they will take more time to process. Using threads to supply the documents might again be too fast as all the pipelines might be busy again. A good example when you should use threads is when you load documents from a web-service and these documents are of normal news article size(not too big).
Note that in the KIMProcessor all the articles are first loaded in memory and then they are supplied to the KIM server async. The right way to code this is to use some kind of async calls to the database and use a synchronized blocking queue, so the moment a document is read it is being send to the KIM server.
Another cool feature to add to the KIMProcessor will be fail-safe support. When one of the servers is down, the documents that were sent to it should now be pulled back to the queue, so that another node in the cluster can process them. Also a automatic check should be done once in a while so that the server can be used again when it is back on-line.
The best solution is to implement KIM with Hadoop, but that will take time.
In Ontotext we have a different tested application which is used to process more than 100 000 documents. The one provided here(the KIMProcessor) is only maintained by me fro now.
Disclaimer:
Keep in mind that this post represents only my personal view of the topic. You should try different configurations and see how it works for yourself or probably use our tested tool.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Posted by Anton Andreev
in Techno-talk
at
13:54
| Comments (0)
| Trackbacks (0)
Last modified on 2009-11-04 15:34
Sunday, May 31. 2009
Strict Rules vs Machine Learning - KIM part 6
Summary:
There are generally two ways to recognize entities from text articles when using Ontotext Kim. Example entities are: people, organizations, locations.
Both methods have their strengths and weaknesses. Things that can not detected by humans can not also be detected by computers.
Using strict rules(better known as Knowledge Engineering)
These rules are implemented by some regular expression language. In this case it is Jape.
The more you customize the rules to detect what you need, the better results you get.
Advantages:
- it you have some rules already available (e.x. for date, money ...) then it might be faster to create the new rules you need and get the job done fast
- a smaller sample corpus might be required in some cases than when using machine learning
- in general effectiveness is bound to the amount of efforts that are needed to produce better rules
Weaknesses:
- in practice the rules might become quite complicated and hard to support. Imagine a 20KB file that describes only one entity. You end up not reading the previous rules and modifying one of them, but rather adding the specific case that was missing in the end of the file and thus increasing the total length of the file and the total complexity of the rules. This is especially true when different people are modifying these rules.
Machine learning
In order to use machine learning you need a framework that implement several machine learning algorithms. You as an expert can define features which will be taken in consideration when the framework is processing the example data:
- consider the length of the word
- consider the case-sensitivity
- consider the case-sensitivity of the previous word
- consider prefixes and suffixes
The idea is not to set the exact rules, but rather make the framework build them itself from specific parts of the text you told the framework to pay attention to. Then you need to supply the machine-learning framework with enough test articles.
Advantages:
- it can give better results than strict rules
Often efforts needed to achieve 80% effectiveness are as much as from 80% to 85%.
Weaknesses:
- needs parameter and algorithm testing (that's actually not such a problem, it just needs some work hours)
- needs more example articles by a factor of 10(assumption) than using strict rules
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. Features in the case of MALLET are either true or false. MALLET provides several algorithms and each of them has its own configuration parameters. MALLET is implemented as plug-in for KIM/GATE.
There is also a second machine learning framework called openNLP that is also implemented as GATE plug-in in Ontotext KIM, and might soon be released as part of the standard KIM/GATE release.
Conclusions/final thoughts:
Both rule-based and machine learning are supported by the custom GATE pipeline for semantic annotation developed by Ontotext for the KIM platform.
In short term it is better to use strict regular expression rules(like Jape) as it gives you results almost momentarily, but in a long term (from both complexity and effectiveness point of view) is definitely better to use a machine learning system like MALLET. Of course a combination of the two should work best, where rules are used first(some of them negative) and then machine learning is applied.
Another machine learning project is: edlin.sourceforge.net
This a short introduction, contact Ontotext for more detailed information.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
There are generally two ways to recognize entities from text articles when using Ontotext Kim. Example entities are: people, organizations, locations.
Both methods have their strengths and weaknesses. Things that can not detected by humans can not also be detected by computers.
Using strict rules(better known as Knowledge Engineering)
These rules are implemented by some regular expression language. In this case it is Jape.
The more you customize the rules to detect what you need, the better results you get.
Advantages:
- it you have some rules already available (e.x. for date, money ...) then it might be faster to create the new rules you need and get the job done fast
- a smaller sample corpus might be required in some cases than when using machine learning
- in general effectiveness is bound to the amount of efforts that are needed to produce better rules
Weaknesses:
- in practice the rules might become quite complicated and hard to support. Imagine a 20KB file that describes only one entity. You end up not reading the previous rules and modifying one of them, but rather adding the specific case that was missing in the end of the file and thus increasing the total length of the file and the total complexity of the rules. This is especially true when different people are modifying these rules.
Machine learning
In order to use machine learning you need a framework that implement several machine learning algorithms. You as an expert can define features which will be taken in consideration when the framework is processing the example data:
- consider the length of the word
- consider the case-sensitivity
- consider the case-sensitivity of the previous word
- consider prefixes and suffixes
The idea is not to set the exact rules, but rather make the framework build them itself from specific parts of the text you told the framework to pay attention to. Then you need to supply the machine-learning framework with enough test articles.
Advantages:
- it can give better results than strict rules
Often efforts needed to achieve 80% effectiveness are as much as from 80% to 85%.
Weaknesses:
- needs parameter and algorithm testing (that's actually not such a problem, it just needs some work hours)
- needs more example articles by a factor of 10(assumption) than using strict rules
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. Features in the case of MALLET are either true or false. MALLET provides several algorithms and each of them has its own configuration parameters. MALLET is implemented as plug-in for KIM/GATE.
There is also a second machine learning framework called openNLP that is also implemented as GATE plug-in in Ontotext KIM, and might soon be released as part of the standard KIM/GATE release.
Conclusions/final thoughts:
Both rule-based and machine learning are supported by the custom GATE pipeline for semantic annotation developed by Ontotext for the KIM platform.
In short term it is better to use strict regular expression rules(like Jape) as it gives you results almost momentarily, but in a long term (from both complexity and effectiveness point of view) is definitely better to use a machine learning system like MALLET. Of course a combination of the two should work best, where rules are used first(some of them negative) and then machine learning is applied.
Another machine learning project is: edlin.sourceforge.net
This a short introduction, contact Ontotext for more detailed information.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Posted by Anton Andreev
in Techno-talk
at
10:32
| Comments (0)
| Trackbacks (0)
Last modified on 2009-11-26 14:34
Wednesday, May 20. 2009
Nexenta vs Debian GNU / kFreeBSD
What is an OS:
- some kernel
- libc - a layer between the kernel and the user programs
- packages
Nexenta is using a Solaris kernel and Debian/Ubuntu packages. Debian GNU / kFreeBSD is using the FreeBSD kernel, but on the surface it should not be much different from a standard Debian edition. Nexenta is using Solaris libc, but Debian GNU / kFreeBSD is using a ported version of GNU libc (glibc) to the FreeBSD kernel. Linux is only a kernel, not a fully operational OS. If we look at the Debian policy we will see that using another kernel is not a problem, but the important is the usage of the GNU libc. Unfortunately there is no port of GNU libc to Solaris. Nevertheless Nexenta is doing fine, although porting packages to Nexenta is probably harder than porting packages Debian GNU / kFreeBSD because of the missing GNU libc on Solaris. Both operation system have already a considerable amount of packages.
Update: In build 107 of Opensolaris the Solaris libc has been released with some compatibility functions for Linux/BSD which will help porting debian packages in Nexenta.
My opinion is that these hybrids are quite welcome as I have doubts about the quality of the Linux kernel.
Freebsd is a pure OS meaning that the Freebsd community produces its own kernel, libc and some of the programs that comes with the OS. This makes everything in the OS much more consistent.
As of 20.05.2009:
1. Debian GNU / kFreeBSD is an official Debian port, which means that all packages and translations should be available and tested in the next Debian release - squeeze, at least I think so.
2. Luca Favatella is working on port of the Debian installer to the Debian GNU / kFreeBSD. You could track his status here.
3. Gnome and xfce4 are available on both ... but do not expect things to work right out of the box.
4. StormOS is a distro based on Nexenta and provides a xfce4 desktop.
Install Sun JDK on Nexenta or StormOS:
I am using Nexenta Core 2 RC3, Nexenta Core 2.0 has been released.
Issue the following commands:
#apt-get update
#apt-get install sun-java6-jdk
#apt-get install sunwlibc
Now you should have the jdk in /usr/lib/jvm
Install Tomcat
You need to download the Tomcat binaries from the Tomcat website. There is no deb package available at the time of writing this post. Tomcat is supposed to be pure Java, so it should work on Nexenta and Debian GNU / kFreeBSD assuming the Java JDK is installed correctly.
Install Ontotext KIM on Nexenta or StormOS
#apt-get install unzip
#unzip kim-platform-2.x
#cd kim-platform-2.x\bin
#nano config_machine.sh
set JAVA_HOME="/usr/lib/jvm/java-6-sun"
set _TOMCAT_HOME="your tomcat location"
#chmod +x startKIM.sh
#./startKIM.sh
In order to use KIM clients (Web services or the Demo website) you need Tomcat(see above).
Copy the wars from \kim-platform-2.x\KIM Clients\ to \Tomcat\webapps. You can check my Install Ontotext KIM post for more information on how see KIM in action.
- some kernel
- libc - a layer between the kernel and the user programs
- packages
Nexenta is using a Solaris kernel and Debian/Ubuntu packages. Debian GNU / kFreeBSD is using the FreeBSD kernel, but on the surface it should not be much different from a standard Debian edition. Nexenta is using Solaris libc, but Debian GNU / kFreeBSD is using a ported version of GNU libc (glibc) to the FreeBSD kernel. Linux is only a kernel, not a fully operational OS. If we look at the Debian policy we will see that using another kernel is not a problem, but the important is the usage of the GNU libc. Unfortunately there is no port of GNU libc to Solaris. Nevertheless Nexenta is doing fine, although porting packages to Nexenta is probably harder than porting packages Debian GNU / kFreeBSD because of the missing GNU libc on Solaris. Both operation system have already a considerable amount of packages.
Update: In build 107 of Opensolaris the Solaris libc has been released with some compatibility functions for Linux/BSD which will help porting debian packages in Nexenta.
My opinion is that these hybrids are quite welcome as I have doubts about the quality of the Linux kernel.
Freebsd is a pure OS meaning that the Freebsd community produces its own kernel, libc and some of the programs that comes with the OS. This makes everything in the OS much more consistent.
As of 20.05.2009:
1. Debian GNU / kFreeBSD is an official Debian port, which means that all packages and translations should be available and tested in the next Debian release - squeeze, at least I think so.
2. Luca Favatella is working on port of the Debian installer to the Debian GNU / kFreeBSD. You could track his status here.
3. Gnome and xfce4 are available on both ... but do not expect things to work right out of the box.
4. StormOS is a distro based on Nexenta and provides a xfce4 desktop.
Install Sun JDK on Nexenta or StormOS:
I am using Nexenta Core 2 RC3, Nexenta Core 2.0 has been released.
Issue the following commands:
#apt-get update
#apt-get install sun-java6-jdk
#apt-get install sunwlibc
Now you should have the jdk in /usr/lib/jvm
Install Tomcat
You need to download the Tomcat binaries from the Tomcat website. There is no deb package available at the time of writing this post. Tomcat is supposed to be pure Java, so it should work on Nexenta and Debian GNU / kFreeBSD assuming the Java JDK is installed correctly.
Install Ontotext KIM on Nexenta or StormOS
#apt-get install unzip
#unzip kim-platform-2.x
#cd kim-platform-2.x\bin
#nano config_machine.sh
set JAVA_HOME="/usr/lib/jvm/java-6-sun"
set _TOMCAT_HOME="your tomcat location"
#chmod +x startKIM.sh
#./startKIM.sh
In order to use KIM clients (Web services or the Demo website) you need Tomcat(see above).
Copy the wars from \kim-platform-2.x\KIM Clients\ to \Tomcat\webapps. You can check my Install Ontotext KIM post for more information on how see KIM in action.
Posted by Anton Andreev
in Techno-talk
at
02:13
| Comments (0)
| Trackbacks (0)
Last modified on 2009-07-10 17:50
(Page 1 of 1, totaling 6 entries)





