Today we are going to talk about performance optimizations in the next version of KIM that will released by the end of this year. Its version number is 3.0 and most likely in will appear in October, but if needed the release will be postponed.
We are going to talk about both clustering(use of more than one KIM sever instance) and multi-threading. Threads are used for executing the KIM semantic annotator in parallel which returns annotated documents.
One of the most important setting to remember is configured in \config\nerc.properties:
# Maximum number of annotation processes that can run at the same time.
# If set to more than 1, KIM will load multiple copies of the pipelines listed in the IE_APP parameter above
# during initialization. Multiple copies of the pipeline allow for parallel annotation of up to that number of documents
# Default: 1 (parallel annotation disabled)
com.ontotext.kim.semanticannotation.PARALLEL_NERCS=6
As you see by using this new parameter in KIM 3.0 you will get 6 instances of the pipeline, so that 6 documents could be processed(annotated) in the same time.
Now you need to take a look at the KIMProcessor I have written. The code is here. Keep in mind that this code has been created with a development build of the KIM 3.0.
Threads have been used to speed-up the supply of the documents to the KIM server. The problem is that this won't speed up you work much. If you supply KIM with too much documents and there is no free pipeline than probably your documents will be queued and you will only take memory.
You may set:
com.ontotext.kim.semanticannotation.PARALLEL_NERCS=auto
and the number of pipelines will be equal to the number of processor cores reported by the OS (on Windows cmd: echo %NUMBER_OF_PROCESSORS%)The threading functionality needs to be extended and would be useful in two cases:
1. When using the KIMProcessor with multiple KIM servers. You could set for example 5 physical machines with 1 KIM server each. The machine that is running the KIMProcessor is the one who reads the documents(Postgresql 8.4 in this example). So if you are reading the documents from a single standard hard-drive it is possible that you need to supply the articles faster now as you now have 5 servers with let's say 6 pipelines each which results in 30 pipelines. In this case the use of threading is definitely useful. Of course the threads won't help when you reach the I/O limit of your hard-drive.
2. If you have big documents they will be read slower and in the same time they will take more time to process. Using threads to supply the documents might again be too fast as all the pipelines might be busy again. A good example when you should use threads is when you load documents from a web-service and these documents are of normal news article size(not too big).
Note that in the KIMProcessor all the articles are first loaded in memory and then they are supplied to the KIM server async. The right way to code this is to use some kind of async calls to the database and use a synchronized blocking queue, so the moment a document is read it is being send to the KIM server.
Another cool feature to add to the KIMProcessor will be fail-safe support. When one of the servers is down, the documents that were sent to it should now be pulled back to the queue, so that another node in the cluster can process them. Also a automatic check should be done once in a while so that the server can be used again when it is back on-line.
The best solution is to implement KIM with Hadoop, but that will take time.
In Ontotext we have a different tested application which is used to process more than 100 000 documents. The one provided here(the KIMProcessor) is only maintained by me fro now.
Disclaimer:
Keep in mind that this post represents only my personal view of the topic. You should try different configurations and see how it works for yourself or probably use our tested tool.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0






