The "Corpus Quality Assurance" tool and Annotation Transfer processing resource make it easy to compare the results of two pipelines.
You won't find this tool in the main GATE "Tools" menu, but at the bottom of the GATE interface when you click on a corpus. This tool is available since GATE 5.1(I think). In order to use the Corpus Quality Assurance tool we need two annotations sets so that we can compare them.
1. Create a datastore
Create a corpus. Then right click it and use the "Save to datastore" option and point it to your datastore. Next populate the corpus and see how the documents do not appear as usually under the "Language resources" tab in your corpus, but in the datastore. Datastores are quite useful as they do not keep all the documents in memory. This way you can annotate a large number of documents before you encounter a problem in GATE or JAVA.
2. Create Annotation Set 1
Warning: Check which version of GATE are you using and what are the plug-ins loaded. GATE can load plug-ins from other GATE versions and instances that you have installed.
You might for example want to test against two versions of GATE. But probably the second GATE will load the same list of plug-ins as the one you have just closed. Yes, the core of the GATE framework will be different, but because you load the same plug-ins, you will probably get the same results. So you might end up comparing the same results or wondering why the results are not the one you expect. Now go to the File->Manage CREOLE Plugins and do the job of selecting your resources explicitly checking the paths to your plugins.
Add a new "Annotation Set Transfer" processing resource to the end of your pipe-line:
inputASName: should be blank to use the default annotation set.
outputASName: will be in our case "Annotation Set 1".
For more information see the documentation.
Now run the pipeline. You should find the new annotation set below the default one. If needed - collapse the default one to see the new one below it.
This new annotation set "Annotation Set 1" has been saved in the datastore automatically.
2. Make changes
Replace plug-in versions or load another instance of GATE. Please check the plug-in versions by inspecting their paths in File->Manage CREOLE Plugins.
3. Create Annotation Set 2
Open the datastore created in step 1(if it is not already available).
Open the corpus contained in that datastore (if it is not already available) by clicking the datastore, scrolling down until you find it, then right click and select "Load". It should now be available in its usual place under the "Language resources".
Check if "Annotation Set 1" is available.
Create "Annotation Set 2" by again adding "Annotation Set Transfer" processing resource and setting the correct parameters. Modify the "setsToKeep" parameter in "Document Reset PR" to "Annotation Set 1" in order to preserve the results from "Annotation Set 1".
Warning: If you get a strange exception it could be due to the Annotation Set Transfer resource. The version in GATE 5.2 trunk requires you to fill the "annotationTypes" list, although this is not a required by specification parameter.
Run the pipeline!
Check if "Annotation Set 2" is available.
4. Inspect
Click on the corpus and then on the "Corpus Quality Assurance" at the bottom. Wait until it initializes, there is a progress bar on the right. This initialization is need so that the tool knows what are you annotations types and features so it can fill the lists on the right.
Select "Annotation Set 1" and "Annotation Set 2" as A and B in the first list. Then select the annotation type that you would like to compare. I suggest that you select a single annotation type per comparison. Then select the features by which you think two annotations are identical. Select measures and then press the "Compare" button.
Warning: The feature "matches" contains the IDs of these annotations that refer to the same entity. A problem arises from the fact that these IDs are different in "Annotation Set 1" and "Annotation Set 2". If you check the document you will see that they refer to the same entities in the document, they are also different by a constant, but they are still different. We need to solve this problem because difference in matches can be indeed a real problem, not simply a difference in IDs. This is because GATE assigns bigger and bigger IDs. One way to go is not to use "matches", but another feature that for example contains the start and end point of an annotation in the document. These two should match when processing the same document and encoding is set correctly.
Click on "Document statistics". Then click on a document where "Only A", "Only B" or "Overlap" are greater then 0. This means we have a difference. To see the actual mismatch - click on the document and then click on the "Annotation Diff" button which is currently the second one on the right top in the GATE UI.
Keep in mind that there are might be some bugs as this one.
Friday, March 12. 2010
GATE Corpus Quality Assurance Tool
Friday, February 12. 2010
Running GATE on OpenSolaris
This post simply proves that GATE 5.1 NLP framework runs on OpenSolaris 2009.06 without any hassle. OpenSolaris uses bash as its default shell script language.
1. Download the GATE archive. Mine is "gate-5.1-build3431-BIN.zip".
2. Unzip the archive:
# unzip gate-5.1-build3431-BIN.zip
3. Install SUNWj6dev in order to have the Sun JDK 1.6 and not only the JRE.
# pkg install SUNWj6dev
You can learn more about the pkg - OpenSolaris Image Packaging System from here.
4. Issue the following commands in bash:
# JAVA_HOME=/usr/jdk/instances/jdk1.6.0
# echo $JAVA_HOME
Make sure that your JDK is really in /usr/jdk/instances/jdk1.6.0
5. Go to the bin folder of GATE and simply type:
# ./gate.sh
1. Download the GATE archive. Mine is "gate-5.1-build3431-BIN.zip".
2. Unzip the archive:
# unzip gate-5.1-build3431-BIN.zip
3. Install SUNWj6dev in order to have the Sun JDK 1.6 and not only the JRE.
# pkg install SUNWj6dev
You can learn more about the pkg - OpenSolaris Image Packaging System from here.
4. Issue the following commands in bash:
# JAVA_HOME=/usr/jdk/instances/jdk1.6.0
# echo $JAVA_HOME
Make sure that your JDK is really in /usr/jdk/instances/jdk1.6.0
5. Go to the bin folder of GATE and simply type:
# ./gate.sh
Wednesday, February 10. 2010
Re-locating/porting a GATE application
The following article will give you basic knowledge on how to move a GATE application from one GATE instance to another.
Each GATE application is stored as a .gapp file.
For example KIM's default pipeline is stored in "IE.gapp" file. Indeed a gapp file is an XML file.
1. Gapp file considerations
1.1 Best place to save a gapp file:
Usually this should be somewhere in the tree of a GATE/KIM deployment. When you save your gate application(also called pipeline) the resulting gapp file is generated with relative paths to your GATE/KIM installation.
1.2 File-systems paths:
Usually a path in a gapp file looks like that:
$relpath$../../../C:/Program%20Files/GATE-5.0/plugins/Ontology_Tools
It is not recommended to use absolute paths, but if you do then you need to use the following syntax:
file:///G:/Work/gate-binary/gate-5.1-stable-3431/plugins/Ontology_Tools
1.3 Rename components:
The version of GATE on the destination machine might be different. Many plug-in names were changed from GATE 5.0 to 5.1, so you might find yourself porting an application from an older version of GATE to a newer. For example the plug-in "NP_Chunking" changed to "Tagger_NP_Chunking".
So you need to change this line from:
$relpath$../../../C:/Program%20Files/GATE-5.0/plugins/NP_Chunking
to:
$relpath$../../../C:/Program%20Files/GATE-5.0/plugins/Tagger_NP_Chunking
1.4 Remove components:
Sometimes you just need to remove components that you do not need. The following block describes a datastore:
<corpus class="gate.util.persistence.CorpusPersistence">
<dsData class="gate.util.persistence.DSPersistence">
<className>gate.persist.LuceneDataStoreImpl</className>
<storageUrlString>file:/D:/Ontologies/DataStore1/</storageUrlString>
</dsData>
<persistenceID class="string">Corpus for CP-HU___1255364292687___8444</persistenceID>
<resourceType>gate.corpora.SerialCorpusImpl</resourceType>
<resourceName>Corpus for CP-HU</resourceName>
<initParams class="gate.util.persistence.MapPersistence">
<mapType>gate.util.SimpleFeatureMapImpl</mapType>
<localMap/>
</initParams>
</corpus>
which can be safely removed all together.
2. Other resources:
Other resources include for example your Jape files. If you are using a standard GATE gazetteer then your gapp file refers to a def file which you also need to relocate. If you are using an OntoGazetteer then again your gapp file will refer to an ontology file and a map file that you need to package too.
Each GATE application is stored as a .gapp file.
For example KIM's default pipeline is stored in "IE.gapp" file. Indeed a gapp file is an XML file.
1. Gapp file considerations
1.1 Best place to save a gapp file:
Usually this should be somewhere in the tree of a GATE/KIM deployment. When you save your gate application(also called pipeline) the resulting gapp file is generated with relative paths to your GATE/KIM installation.
1.2 File-systems paths:
Usually a path in a gapp file looks like that:
It is not recommended to use absolute paths, but if you do then you need to use the following syntax:
1.3 Rename components:
The version of GATE on the destination machine might be different. Many plug-in names were changed from GATE 5.0 to 5.1, so you might find yourself porting an application from an older version of GATE to a newer. For example the plug-in "NP_Chunking" changed to "Tagger_NP_Chunking".
So you need to change this line from:
to:
1.4 Remove components:
Sometimes you just need to remove components that you do not need. The following block describes a datastore:
<corpus class="gate.util.persistence.CorpusPersistence">
<dsData class="gate.util.persistence.DSPersistence">
<className>gate.persist.LuceneDataStoreImpl</className>
<storageUrlString>file:/D:/Ontologies/DataStore1/</storageUrlString>
</dsData>
<persistenceID class="string">Corpus for CP-HU___1255364292687___8444</persistenceID>
<resourceType>gate.corpora.SerialCorpusImpl</resourceType>
<resourceName>Corpus for CP-HU</resourceName>
<initParams class="gate.util.persistence.MapPersistence">
<mapType>gate.util.SimpleFeatureMapImpl</mapType>
<localMap/>
</initParams>
</corpus>
which can be safely removed all together.
2. Other resources:
Other resources include for example your Jape files. If you are using a standard GATE gazetteer then your gapp file refers to a def file which you also need to relocate. If you are using an OntoGazetteer then again your gapp file will refer to an ontology file and a map file that you need to package too.
Thursday, September 3. 2009
KIM Multi-threaded Clustered Client Application - KIM part 9
Summary:
Today we are going to talk about performance optimizations in the next version of KIM that will released by the end of this year. Its version number is 3.0 and most likely in will appear in October, but if needed the release will be postponed.
We are going to talk about both clustering(use of more than one KIM sever instance) and multi-threading. Threads are used for executing the KIM semantic annotator in parallel which returns annotated documents.
One of the most important setting to remember is configured in \config\nerc.properties:
# Maximum number of annotation processes that can run at the same time.
# If set to more than 1, KIM will load multiple copies of the pipelines listed in the IE_APP parameter above
# during initialization. Multiple copies of the pipeline allow for parallel annotation of up to that number of documents
# Default: 1 (parallel annotation disabled)
As you see by using this new parameter in KIM 3.0 you will get 6 instances of the pipeline, so that 6 documents could be processed(annotated) in the same time.
Now you need to take a look at the KIMProcessor I have written. The code is here. Keep in mind that this code has been created with a development build of the KIM 3.0.
Threads have been used to speed-up the supply of the documents to the KIM server. The problem is that this won't speed up you work much. If you supply KIM with too much documents and there is no free pipeline than probably your documents will be queued and you will only take memory.
You may set:
The threading functionality needs to be extended and would be useful in two cases:
1. When using the KIMProcessor with multiple KIM servers. You could set for example 5 physical machines with 1 KIM server each. The machine that is running the KIMProcessor is the one who reads the documents(Postgresql 8.4 in this example). So if you are reading the documents from a single standard hard-drive it is possible that you need to supply the articles faster now as you now have 5 servers with let's say 6 pipelines each which results in 30 pipelines. In this case the use of threading is definitely useful. Of course the threads won't help when you reach the I/O limit of your hard-drive.
2. If you have big documents they will be read slower and in the same time they will take more time to process. Using threads to supply the documents might again be too fast as all the pipelines might be busy again. A good example when you should use threads is when you load documents from a web-service and these documents are of normal news article size(not too big).
Note that in the KIMProcessor all the articles are first loaded in memory and then they are supplied to the KIM server async. The right way to code this is to use some kind of async calls to the database and use a synchronized blocking queue, so the moment a document is read it is being send to the KIM server.
Another cool feature to add to the KIMProcessor will be fail-safe support. When one of the servers is down, the documents that were sent to it should now be pulled back to the queue, so that another node in the cluster can process them. Also a automatic check should be done once in a while so that the server can be used again when it is back on-line.
The best solution is to implement KIM with Hadoop, but that will take time.
In Ontotext we have a different tested application which is used to process more than 100 000 documents. The one provided here(the KIMProcessor) is only maintained by me fro now.
Disclaimer:
Keep in mind that this post represents only my personal view of the topic. You should try different configurations and see how it works for yourself or probably use our tested tool.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Today we are going to talk about performance optimizations in the next version of KIM that will released by the end of this year. Its version number is 3.0 and most likely in will appear in October, but if needed the release will be postponed.
We are going to talk about both clustering(use of more than one KIM sever instance) and multi-threading. Threads are used for executing the KIM semantic annotator in parallel which returns annotated documents.
One of the most important setting to remember is configured in \config\nerc.properties:
# Maximum number of annotation processes that can run at the same time.
# If set to more than 1, KIM will load multiple copies of the pipelines listed in the IE_APP parameter above
# during initialization. Multiple copies of the pipeline allow for parallel annotation of up to that number of documents
# Default: 1 (parallel annotation disabled)
com.ontotext.kim.semanticannotation.PARALLEL_NERCS=6
As you see by using this new parameter in KIM 3.0 you will get 6 instances of the pipeline, so that 6 documents could be processed(annotated) in the same time.
Now you need to take a look at the KIMProcessor I have written. The code is here. Keep in mind that this code has been created with a development build of the KIM 3.0.
Threads have been used to speed-up the supply of the documents to the KIM server. The problem is that this won't speed up you work much. If you supply KIM with too much documents and there is no free pipeline than probably your documents will be queued and you will only take memory.
You may set:
com.ontotext.kim.semanticannotation.PARALLEL_NERCS=auto
and the number of pipelines will be equal to the number of processor cores reported by the OS (on Windows cmd: echo %NUMBER_OF_PROCESSORS%)The threading functionality needs to be extended and would be useful in two cases:
1. When using the KIMProcessor with multiple KIM servers. You could set for example 5 physical machines with 1 KIM server each. The machine that is running the KIMProcessor is the one who reads the documents(Postgresql 8.4 in this example). So if you are reading the documents from a single standard hard-drive it is possible that you need to supply the articles faster now as you now have 5 servers with let's say 6 pipelines each which results in 30 pipelines. In this case the use of threading is definitely useful. Of course the threads won't help when you reach the I/O limit of your hard-drive.
2. If you have big documents they will be read slower and in the same time they will take more time to process. Using threads to supply the documents might again be too fast as all the pipelines might be busy again. A good example when you should use threads is when you load documents from a web-service and these documents are of normal news article size(not too big).
Note that in the KIMProcessor all the articles are first loaded in memory and then they are supplied to the KIM server async. The right way to code this is to use some kind of async calls to the database and use a synchronized blocking queue, so the moment a document is read it is being send to the KIM server.
Another cool feature to add to the KIMProcessor will be fail-safe support. When one of the servers is down, the documents that were sent to it should now be pulled back to the queue, so that another node in the cluster can process them. Also a automatic check should be done once in a while so that the server can be used again when it is back on-line.
The best solution is to implement KIM with Hadoop, but that will take time.
In Ontotext we have a different tested application which is used to process more than 100 000 documents. The one provided here(the KIMProcessor) is only maintained by me fro now.
Disclaimer:
Keep in mind that this post represents only my personal view of the topic. You should try different configurations and see how it works for yourself or probably use our tested tool.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Posted by Anton Andreev
in Techno-talk
at
13:54
| Comments (0)
| Trackbacks (0)
Last modified on 2009-11-04 15:34
Monday, May 18. 2009
GATE tutorial - KIM part 3
Summary:
This a beginners GATE tutorial.
GATE is a tool for (NLP)Natural Language Processing. GATE helps you extract data from text articles, which you can turn into a computer knowledge. It provides you a development IDE that helps you create and test an application. Once you are done you can have your application executed from JAVA the same way you did from the IDE. GATE Applications ca be incorporated in Ontotext KIM
First you should read the user guide. Also I am using GATE-5.0-beta1 build 3048, Eclipse 3.4.2(used in the Java sample) on Windows XP SP3.
Let's say we want to find the relation when company A acquires company B .
Gather some example articles. Create corpus. "Corpus" is just a funny name for a group of articles. My articles are here.
We are going to focus on English articles. GATE gives you the ability to create you own text processing applications. Maybe there are already such GATE applications that are good and can be used for general purpose text processing. "Application" is GATE term, we are not talking about applications in general. The point is that we could create an application from scratch, it is not that hard, but it is always better and most of all easier to improve upon something.
We choose a GATE application which is called "ANNIE". Indeed ANNIE is not some application, it is an integral part of GATE itself. You should try to process some articles with ANNIE, see the Annotation Sets and Annotation List and get used to them. Keep in mind ANNIE is primary English orientated.
Next we need to add additional functionality to ANNIE. We could add a lot of different stuff, but we add a "Jape Transducer" which points to a file where we describe what should be detected in our articles. That file is a "Jape file". Don't think what is Jape right now. Next click on ANNIE. You will see some processing resources (on the right). The Jape transducer we've just created is such a processing resource. We need to add it to the right. You need to know that these processing resources work at different levels and each can depend from the output of others. That's the meaning of "pipe" in GATE. So it will be best if we leave our new processing resources as last (the bottom) on the list.
Yes, next is Jape. Jape is a language similar to regular expressions. We are going to use acquire.jape. It has two rules. Also I did put it in \GATE-5.0-beta1\plugins\ANNIE\resources\NE\grammar\acquire.jape.
I have also modified ANNIE Gazetteer in \GATE-5.0-beta1\plugins\ANNIE\resources\gazetteer\company.lst by adding two lines:
MySQL
MySql
to make sure MySql is recognized as a company.
The whole idea is to make your enhanced ANNIE work by supplying a correct Jape grammar and test it. Then you save your application to a file. You do that by right-clicking on a GATE application and select "Save application state".
1. You should save your application with "gapp" extension (no problem if you do not).
2. It is better if you remove the corpus in your application before saving, cause that corpus will become one more dependency to your application.
Gapp files are simply XML files which describe where is everything you use in your GATE application. This means you can change them. For what? You won't need to modify them as far as you continue to use the gapp file/application form the location where you saved it and you did not change the location of your GATE installation. If you change myapp.gapp from c:\gatetest to d:\work\gatetest you will see that things will probably go wrong. Modifying the paths is easy.
Next we create a normal Java console application. We add all jars in gate/lib. We check to make sure we added ALL the jars! I was having problems because I've ignored some.
And then what?
Thank you about this question. Then we use this java code:
import gate.Annotation;
import gate.Document;
import gate.Corpus;
import gate.CorpusController;
import gate.FeatureMap;
import gate.AnnotationSet;
import gate.Gate;
import gate.Factory;
import gate.util.*;
import gate.util.persistence.PersistenceManager;
import java.util.Set;
import java.util.HashSet;
import java.util.List;
import java.util.ArrayList;
import java.util.Iterator;
import java.io.File;
import java.io.FileFilter;
import java.io.FileOutputStream;
import java.io.BufferedOutputStream;
import java.io.FilenameFilter;
import java.io.OutputStreamWriter;
public class BatchProcessApp {
public static void main(String[] args) throws Exception {
// initialise GATE - this must be done before calling any GATE APIs
Gate.init();
File[] files = getFilesFromDir("F:/Temp/articles");
// File gappFile = new File("g:/ModifiedAnnie.gapp");
File gappFile = new File("g:/annie_acquire_nocorpus.gapp");
// load the saved application
CorpusController application = (CorpusController) PersistenceManager
.loadObjectFromFile(gappFile);
// Create a Corpus to use. We recycle the same Corpus object for each
// iteration. The string parameter to newCorpus() is simply the
// GATE-internal name to use for the corpus. It has no particular
// significance.
Corpus corpus = Factory.newCorpus("BatchProcessApp Corpus");
application.setCorpus(corpus);
// process the files one by one
for (int i = 0; i < files.length; i++) {
if (!files[i].getName().endsWith(".txt"))
continue;
// load the document (using the specified encoding if one was given)
File docFile = files[i];// new File(args[i]);
System.out.print("Processing document " + docFile + "...");
Document doc = Factory.newDocument(docFile.toURL(), encoding);
// put the document in the corpus
corpus.add(doc);
// run the application
application.execute();
// remove the document from the corpus again
corpus.clear();
// we only extract annotations from the default (unnamed)
// AnnotationSet
// in this example
AnnotationSet defaultAnnots = doc.getAnnotations();
System.out.println();
for (Annotation ann : defaultAnnots) {
FeatureMap map = ann.getFeatures();
if (map.get("relationType") != null)
System.out.println("## " + map.get("relationType")
+ " #CompanyA=" + map.get("companyA")
+ " #CompanyB=" + map.get("companyB")
);
}
Factory.deleteResource(doc);
System.out.println("done");
} // for each file
System.out.println("All done");
}
private static String encoding = null;
private static File[] getFilesFromDir(String path) {
File dir = new File(path);
File[] files = dir.listFiles();
return files;
}
}
You can view/download the source from code.google.com
The Gate.init() should be called only once! To run this code you need to set the path to your gapp file and the location of your articles(no folder recursive scanning). Note that we could load all documents in one corpus, but instead the code loads only one document per corpus, this helps system resources to be utilized better. Also this code sample will display only annotations that have a feature="relationType". You should make it display everything.
You can see that the code instantiates the GAPE application (a modified ANNIE) in the form of a "CorpusController" and it is named "application"
. It is something like Java/.NET remoting, but you point it to a gapp file, which has all the meta information to construct the object/the application.
Conclusion:
Data extraction with GATE can be done, it just needs reading through documentation, post questions to the GATE mailing list.
Maybe it is a good idea to create a Linux(or FreeBSD, OpenSolaris, Nexenta) vmware image(or Xen, VirtualBox) which has GATE, Eclipse and GATE's samples installed and working properly.
Disclaimer:
This Java sample is based on this sample.
I am a GATE newbie, so do not expect for now that I would be able to answer your questions.
Credits:
Special thanks goes to: everyone from the GATE mailing list, Marin Nozhchev(Ontotext), Stanislav Zlatinov.
Todo:
Explain the Jape code.
Add my articles.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
This a beginners GATE tutorial.
GATE is a tool for (NLP)Natural Language Processing. GATE helps you extract data from text articles, which you can turn into a computer knowledge. It provides you a development IDE that helps you create and test an application. Once you are done you can have your application executed from JAVA the same way you did from the IDE. GATE Applications ca be incorporated in Ontotext KIM
First you should read the user guide. Also I am using GATE-5.0-beta1 build 3048, Eclipse 3.4.2(used in the Java sample) on Windows XP SP3.
Let's say we want to find the relation when company A acquires company B .
Gather some example articles. Create corpus. "Corpus" is just a funny name for a group of articles. My articles are here.
We are going to focus on English articles. GATE gives you the ability to create you own text processing applications. Maybe there are already such GATE applications that are good and can be used for general purpose text processing. "Application" is GATE term, we are not talking about applications in general. The point is that we could create an application from scratch, it is not that hard, but it is always better and most of all easier to improve upon something.
We choose a GATE application which is called "ANNIE". Indeed ANNIE is not some application, it is an integral part of GATE itself. You should try to process some articles with ANNIE, see the Annotation Sets and Annotation List and get used to them. Keep in mind ANNIE is primary English orientated.
Next we need to add additional functionality to ANNIE. We could add a lot of different stuff, but we add a "Jape Transducer" which points to a file where we describe what should be detected in our articles. That file is a "Jape file". Don't think what is Jape right now. Next click on ANNIE. You will see some processing resources (on the right). The Jape transducer we've just created is such a processing resource. We need to add it to the right. You need to know that these processing resources work at different levels and each can depend from the output of others. That's the meaning of "pipe" in GATE. So it will be best if we leave our new processing resources as last (the bottom) on the list.
Yes, next is Jape. Jape is a language similar to regular expressions. We are going to use acquire.jape. It has two rules. Also I did put it in \GATE-5.0-beta1\plugins\ANNIE\resources\NE\grammar\acquire.jape.
I have also modified ANNIE Gazetteer in \GATE-5.0-beta1\plugins\ANNIE\resources\gazetteer\company.lst by adding two lines:
MySQL
MySql
to make sure MySql is recognized as a company.
The whole idea is to make your enhanced ANNIE work by supplying a correct Jape grammar and test it. Then you save your application to a file. You do that by right-clicking on a GATE application and select "Save application state".
1. You should save your application with "gapp" extension (no problem if you do not).
2. It is better if you remove the corpus in your application before saving, cause that corpus will become one more dependency to your application.
Gapp files are simply XML files which describe where is everything you use in your GATE application. This means you can change them. For what? You won't need to modify them as far as you continue to use the gapp file/application form the location where you saved it and you did not change the location of your GATE installation. If you change myapp.gapp from c:\gatetest to d:\work\gatetest you will see that things will probably go wrong. Modifying the paths is easy.
Next we create a normal Java console application. We add all jars in gate/lib. We check to make sure we added ALL the jars! I was having problems because I've ignored some.
And then what?
Thank you about this question. Then we use this java code:
import gate.Annotation;
import gate.Document;
import gate.Corpus;
import gate.CorpusController;
import gate.FeatureMap;
import gate.AnnotationSet;
import gate.Gate;
import gate.Factory;
import gate.util.*;
import gate.util.persistence.PersistenceManager;
import java.util.Set;
import java.util.HashSet;
import java.util.List;
import java.util.ArrayList;
import java.util.Iterator;
import java.io.File;
import java.io.FileFilter;
import java.io.FileOutputStream;
import java.io.BufferedOutputStream;
import java.io.FilenameFilter;
import java.io.OutputStreamWriter;
public class BatchProcessApp {
public static void main(String[] args) throws Exception {
// initialise GATE - this must be done before calling any GATE APIs
Gate.init();
File[] files = getFilesFromDir("F:/Temp/articles");
// File gappFile = new File("g:/ModifiedAnnie.gapp");
File gappFile = new File("g:/annie_acquire_nocorpus.gapp");
// load the saved application
CorpusController application = (CorpusController) PersistenceManager
.loadObjectFromFile(gappFile);
// Create a Corpus to use. We recycle the same Corpus object for each
// iteration. The string parameter to newCorpus() is simply the
// GATE-internal name to use for the corpus. It has no particular
// significance.
Corpus corpus = Factory.newCorpus("BatchProcessApp Corpus");
application.setCorpus(corpus);
// process the files one by one
for (int i = 0; i < files.length; i++) {
if (!files[i].getName().endsWith(".txt"))
continue;
// load the document (using the specified encoding if one was given)
File docFile = files[i];// new File(args[i]);
System.out.print("Processing document " + docFile + "...");
Document doc = Factory.newDocument(docFile.toURL(), encoding);
// put the document in the corpus
corpus.add(doc);
// run the application
application.execute();
// remove the document from the corpus again
corpus.clear();
// we only extract annotations from the default (unnamed)
// AnnotationSet
// in this example
AnnotationSet defaultAnnots = doc.getAnnotations();
System.out.println();
for (Annotation ann : defaultAnnots) {
FeatureMap map = ann.getFeatures();
if (map.get("relationType") != null)
System.out.println("## " + map.get("relationType")
+ " #CompanyA=" + map.get("companyA")
+ " #CompanyB=" + map.get("companyB")
);
}
Factory.deleteResource(doc);
System.out.println("done");
} // for each file
System.out.println("All done");
}
private static String encoding = null;
private static File[] getFilesFromDir(String path) {
File dir = new File(path);
File[] files = dir.listFiles();
return files;
}
}
You can view/download the source from code.google.com
The Gate.init() should be called only once! To run this code you need to set the path to your gapp file and the location of your articles(no folder recursive scanning). Note that we could load all documents in one corpus, but instead the code loads only one document per corpus, this helps system resources to be utilized better. Also this code sample will display only annotations that have a feature="relationType". You should make it display everything.
You can see that the code instantiates the GAPE application (a modified ANNIE) in the form of a "CorpusController" and it is named "application"
Conclusion:
Data extraction with GATE can be done, it just needs reading through documentation, post questions to the GATE mailing list.
Maybe it is a good idea to create a Linux(or FreeBSD, OpenSolaris, Nexenta) vmware image(or Xen, VirtualBox) which has GATE, Eclipse and GATE's samples installed and working properly.
Disclaimer:
This Java sample is based on this sample.
I am a GATE newbie, so do not expect for now that I would be able to answer your questions.
Credits:
Special thanks goes to: everyone from the GATE mailing list, Marin Nozhchev(Ontotext), Stanislav Zlatinov.
Todo:
Explain the Jape code.
Add my articles.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Posted by Anton Andreev
in Techno-talk
at
13:02
| Comment (1)
| Trackbacks (0)
Last modified on 2009-11-17 13:47
(Page 1 of 1, totaling 5 entries)





