The "Corpus Quality Assurance" tool and Annotation Transfer processing resource make it easy to compare the results of two pipelines.
You won't find this tool in the main GATE "Tools" menu, but at the bottom of the GATE interface when you click on a corpus. This tool is available since GATE 5.1(I think). In order to use the Corpus Quality Assurance tool we need two annotations sets so that we can compare them.
1. Create a datastore
Create a corpus. Then right click it and use the "Save to datastore" option and point it to your datastore. Next populate the corpus and see how the documents do not appear as usually under the "Language resources" tab in your corpus, but in the datastore. Datastores are quite useful as they do not keep all the documents in memory. This way you can annotate a large number of documents before you encounter a problem in GATE or JAVA.
2. Create Annotation Set 1
Warning: Check which version of GATE are you using and what are the plug-ins loaded. GATE can load plug-ins from other GATE versions and instances that you have installed.
You might for example want to test against two versions of GATE. But probably the second GATE will load the same list of plug-ins as the one you have just closed. Yes, the core of the GATE framework will be different, but because you load the same plug-ins, you will probably get the same results. So you might end up comparing the same results or wondering why the results are not the one you expect. Now go to the File->Manage CREOLE Plugins and do the job of selecting your resources explicitly checking the paths to your plugins.
Add a new "Annotation Set Transfer" processing resource to the end of your pipe-line:
inputASName: should be blank to use the default annotation set.
outputASName: will be in our case "Annotation Set 1".
For more information see the documentation.
Now run the pipeline. You should find the new annotation set below the default one. If needed - collapse the default one to see the new one below it.
This new annotation set "Annotation Set 1" has been saved in the datastore automatically.
2. Make changes
Replace plug-in versions or load another instance of GATE. Please check the plug-in versions by inspecting their paths in File->Manage CREOLE Plugins.
3. Create Annotation Set 2
Open the datastore created in step 1(if it is not already available).
Open the corpus contained in that datastore (if it is not already available) by clicking the datastore, scrolling down until you find it, then right click and select "Load". It should now be available in its usual place under the "Language resources".
Check if "Annotation Set 1" is available.
Create "Annotation Set 2" by again adding "Annotation Set Transfer" processing resource and setting the correct parameters. Modify the "setsToKeep" parameter in "Document Reset PR" to "Annotation Set 1" in order to preserve the results from "Annotation Set 1".
Warning: If you get a strange exception it could be due to the Annotation Set Transfer resource. The version in GATE 5.2 trunk requires you to fill the "annotationTypes" list, although this is not a required by specification parameter.
Run the pipeline!
Check if "Annotation Set 2" is available.
4. Inspect
Click on the corpus and then on the "Corpus Quality Assurance" at the bottom. Wait until it initializes, there is a progress bar on the right. This initialization is need so that the tool knows what are you annotations types and features so it can fill the lists on the right.
Select "Annotation Set 1" and "Annotation Set 2" as A and B in the first list. Then select the annotation type that you would like to compare. I suggest that you select a single annotation type per comparison. Then select the features by which you think two annotations are identical. Select measures and then press the "Compare" button.
Warning: The feature "matches" contains the IDs of these annotations that refer to the same entity. A problem arises from the fact that these IDs are different in "Annotation Set 1" and "Annotation Set 2". If you check the document you will see that they refer to the same entities in the document, they are also different by a constant, but they are still different. We need to solve this problem because difference in matches can be indeed a real problem, not simply a difference in IDs. This is because GATE assigns bigger and bigger IDs. One way to go is not to use "matches", but another feature that for example contains the start and end point of an annotation in the document. These two should match when processing the same document and encoding is set correctly.
Click on "Document statistics". Then click on a document where "Only A", "Only B" or "Overlap" are greater then 0. This means we have a difference. To see the actual mismatch - click on the document and then click on the "Annotation Diff" button which is currently the second one on the right top in the GATE UI.
Keep in mind that there are might be some bugs as this one.
Friday, March 12. 2010
GATE Corpus Quality Assurance Tool
Friday, February 12. 2010
Running GATE on OpenSolaris
This post simply proves that GATE 5.1 NLP framework runs on OpenSolaris 2009.06 without any hassle. OpenSolaris uses bash as its default shell script language.
1. Download the GATE archive. Mine is "gate-5.1-build3431-BIN.zip".
2. Unzip the archive:
# unzip gate-5.1-build3431-BIN.zip
3. Install SUNWj6dev in order to have the Sun JDK 1.6 and not only the JRE.
# pkg install SUNWj6dev
You can learn more about the pkg - OpenSolaris Image Packaging System from here.
4. Issue the following commands in bash:
# JAVA_HOME=/usr/jdk/instances/jdk1.6.0
# echo $JAVA_HOME
Make sure that your JDK is really in /usr/jdk/instances/jdk1.6.0
5. Go to the bin folder of GATE and simply type:
# ./gate.sh
1. Download the GATE archive. Mine is "gate-5.1-build3431-BIN.zip".
2. Unzip the archive:
# unzip gate-5.1-build3431-BIN.zip
3. Install SUNWj6dev in order to have the Sun JDK 1.6 and not only the JRE.
# pkg install SUNWj6dev
You can learn more about the pkg - OpenSolaris Image Packaging System from here.
4. Issue the following commands in bash:
# JAVA_HOME=/usr/jdk/instances/jdk1.6.0
# echo $JAVA_HOME
Make sure that your JDK is really in /usr/jdk/instances/jdk1.6.0
5. Go to the bin folder of GATE and simply type:
# ./gate.sh
Wednesday, February 10. 2010
Re-locating/porting a GATE application
The following article will give you basic knowledge on how to move a GATE application from one GATE instance to another.
Each GATE application is stored as a .gapp file.
For example KIM's default pipeline is stored in "IE.gapp" file. Indeed a gapp file is an XML file.
1. Gapp file considerations
1.1 Best place to save a gapp file:
Usually this should be somewhere in the tree of a GATE/KIM deployment. When you save your gate application(also called pipeline) the resulting gapp file is generated with relative paths to your GATE/KIM installation.
1.2 File-systems paths:
Usually a path in a gapp file looks like that:
$relpath$../../../C:/Program%20Files/GATE-5.0/plugins/Ontology_Tools
It is not recommended to use absolute paths, but if you do then you need to use the following syntax:
file:///G:/Work/gate-binary/gate-5.1-stable-3431/plugins/Ontology_Tools
1.3 Rename components:
The version of GATE on the destination machine might be different. Many plug-in names were changed from GATE 5.0 to 5.1, so you might find yourself porting an application from an older version of GATE to a newer. For example the plug-in "NP_Chunking" changed to "Tagger_NP_Chunking".
So you need to change this line from:
$relpath$../../../C:/Program%20Files/GATE-5.0/plugins/NP_Chunking
to:
$relpath$../../../C:/Program%20Files/GATE-5.0/plugins/Tagger_NP_Chunking
1.4 Remove components:
Sometimes you just need to remove components that you do not need. The following block describes a datastore:
<corpus class="gate.util.persistence.CorpusPersistence">
<dsData class="gate.util.persistence.DSPersistence">
<className>gate.persist.LuceneDataStoreImpl</className>
<storageUrlString>file:/D:/Ontologies/DataStore1/</storageUrlString>
</dsData>
<persistenceID class="string">Corpus for CP-HU___1255364292687___8444</persistenceID>
<resourceType>gate.corpora.SerialCorpusImpl</resourceType>
<resourceName>Corpus for CP-HU</resourceName>
<initParams class="gate.util.persistence.MapPersistence">
<mapType>gate.util.SimpleFeatureMapImpl</mapType>
<localMap/>
</initParams>
</corpus>
which can be safely removed all together.
2. Other resources:
Other resources include for example your Jape files. If you are using a standard GATE gazetteer then your gapp file refers to a def file which you also need to relocate. If you are using an OntoGazetteer then again your gapp file will refer to an ontology file and a map file that you need to package too.
Each GATE application is stored as a .gapp file.
For example KIM's default pipeline is stored in "IE.gapp" file. Indeed a gapp file is an XML file.
1. Gapp file considerations
1.1 Best place to save a gapp file:
Usually this should be somewhere in the tree of a GATE/KIM deployment. When you save your gate application(also called pipeline) the resulting gapp file is generated with relative paths to your GATE/KIM installation.
1.2 File-systems paths:
Usually a path in a gapp file looks like that:
It is not recommended to use absolute paths, but if you do then you need to use the following syntax:
1.3 Rename components:
The version of GATE on the destination machine might be different. Many plug-in names were changed from GATE 5.0 to 5.1, so you might find yourself porting an application from an older version of GATE to a newer. For example the plug-in "NP_Chunking" changed to "Tagger_NP_Chunking".
So you need to change this line from:
to:
1.4 Remove components:
Sometimes you just need to remove components that you do not need. The following block describes a datastore:
<corpus class="gate.util.persistence.CorpusPersistence">
<dsData class="gate.util.persistence.DSPersistence">
<className>gate.persist.LuceneDataStoreImpl</className>
<storageUrlString>file:/D:/Ontologies/DataStore1/</storageUrlString>
</dsData>
<persistenceID class="string">Corpus for CP-HU___1255364292687___8444</persistenceID>
<resourceType>gate.corpora.SerialCorpusImpl</resourceType>
<resourceName>Corpus for CP-HU</resourceName>
<initParams class="gate.util.persistence.MapPersistence">
<mapType>gate.util.SimpleFeatureMapImpl</mapType>
<localMap/>
</initParams>
</corpus>
which can be safely removed all together.
2. Other resources:
Other resources include for example your Jape files. If you are using a standard GATE gazetteer then your gapp file refers to a def file which you also need to relocate. If you are using an OntoGazetteer then again your gapp file will refer to an ontology file and a map file that you need to package too.
Thursday, December 17. 2009
Compiling GATE
This post explains some steps of how to download and set up the GATE source code. This is needed when you would like to improve something.
SVN checkout: https://gate.svn.sourceforge.net/svnroot/gate
Open Eclipse in a new workspace (recommended). Use File->Import->Existing Projects into workspace->select your gate source dir->Finish.
Use the "Java Element Filters" to hide all "Non-Java elements" to make your project more compact.
Update:
My problems were related to an error that prevented me from downloading all GATE files while svn checking out the source. The problem is due to the fact that a filename allowed on Linux is not allowed on Windows. It was about ".cow:no-iframe" and ":" is not allowed on Windows. This halts the entire svn checkout and made me do all sort of tweaks and patches. The GATE source is over 500 MB and 13 000 files, so make sure you have everything before trying to fix it like me. If you are having problems like me, you should try a checkout on Linux to see if it is an OS dependency. If you copy source from Linux to Windows then you need to check in Eclipse that Properties->Resource->Text File Encoding->other is set to "UTF8".
I am interested in refactoring the code of the Othomatcher. This is the processing resource that matches all annotations of the same entity that they are indeed referring to the same thing. This is needed when a person or an organization is mentioned in different forms in the same document.
SVN checkout: https://gate.svn.sourceforge.net/svnroot/gate
Open Eclipse in a new workspace (recommended). Use File->Import->Existing Projects into workspace->select your gate source dir->Finish.
Use the "Java Element Filters" to hide all "Non-Java elements" to make your project more compact.
Update:
My problems were related to an error that prevented me from downloading all GATE files while svn checking out the source. The problem is due to the fact that a filename allowed on Linux is not allowed on Windows. It was about ".cow:no-iframe" and ":" is not allowed on Windows. This halts the entire svn checkout and made me do all sort of tweaks and patches. The GATE source is over 500 MB and 13 000 files, so make sure you have everything before trying to fix it like me. If you are having problems like me, you should try a checkout on Linux to see if it is an OS dependency. If you copy source from Linux to Windows then you need to check in Eclipse that Properties->Resource->Text File Encoding->other is set to "UTF8".
I am interested in refactoring the code of the Othomatcher. This is the processing resource that matches all annotations of the same entity that they are indeed referring to the same thing. This is needed when a person or an organization is mentioned in different forms in the same document.
Monday, November 2. 2009
Gazetteers - KIM/GATE part 7
Gazetteers are called these processing resources in GATE that use dictionary(any data source) to annotate text. This is the simplest and most intuitive way to use predefined knowledge to annotate documents.
The GATE user interface for gazetteers (which is also used in KIM) is not that easy for beginners (at least for me).
Demo 1
Demo 1 shows how to create and use a gazetteer from the user interface. "def" files are these files that contain a list of "lst" files. On the other hand every "lst" file contains items - one per row. It is a 3 level hierarchy. This organization is shown in the demo by creating a new "def" file called "MyGazetter.def". Then we create two "lst" files. After creating the first one (people.lst) we add this "lst" file to the "def" file by using the "insert menu" on the left.
Remarks:
1. Do not forget to save your "lst" and "def" files! That is exactly what happened the first time I pressed "Run Application", so I went back and made sure everything is saved and then I reran the application.
2. While creating the gazetteer in the beginning we can not create a new "def" file by setting the "listsURL" to a non-existent "def" file as this will trigger an error. We do that later on by clicking the "New" button in the "Linear Definition" panel.
3. You can set in the gazetteer options that if several annotations overlap only the longest matters!
4. When we created the "animals.lst" file and started entering values, note that there was no indication anywhere in the interface on which file we were now working on.
5. Also keep in mind that by replacing the ANNIE gazetteer with our own instance, we have to put the new one at the same position as the old one in the pipeline. It was third place. The gazetteer uses some results from previous steps in the pipeline and other GATE processing resources expect results from the gazetteer, so position really matters!
Demo 2
In this demo:
- we see the use of Morphological Analyzer to get the root of a word
- and the use of Flexible Gazetteer to annotate all forms of a word
When we run the standard ANNIE gazetteer we only match "city" and not "cities" in the lookup annotations. We need to add the CREOLE plug-in directory "Tools" to enable both the Morphological Analyzer and Flexible Gazetteer. Next we add the Morphological Analyzer and now we get an additional feature to every token: "Token.root" which contains the root of a word which in our case is "city". The FG(Flexible Gazetteer) does not work the usual way. It does not process the annotations itself, but it works on a selected feature of a selected annotation. So by using the FG we need to select a gazetteer (as this a required parameter to create a FG) so we choose the ANNIE gazetteer to process the "Token.root" annotation.feature. By using the FG we make a standard ANNIE gazetteer see not "cities", but "city" instead. And because "city" is being recognized by default by the ANNIE gazetteer as location, so becomes and "cities".
Demo 2 shows what is the result only with the ANNIE gazetteer and the result from the joint work of the Morphological Analyzer and Flexible Gazetteer.
You may want to see the result produced only by the FG. To do that you need to give a name to the "outputAnnotationSetName" option in the FG, which is available when you click on the ANNIE application and then click on the FG processing resource on the right.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
The GATE user interface for gazetteers (which is also used in KIM) is not that easy for beginners (at least for me).
Demo 1
Demo 1 shows how to create and use a gazetteer from the user interface. "def" files are these files that contain a list of "lst" files. On the other hand every "lst" file contains items - one per row. It is a 3 level hierarchy. This organization is shown in the demo by creating a new "def" file called "MyGazetter.def". Then we create two "lst" files. After creating the first one (people.lst) we add this "lst" file to the "def" file by using the "insert menu" on the left.
Remarks:
1. Do not forget to save your "lst" and "def" files! That is exactly what happened the first time I pressed "Run Application", so I went back and made sure everything is saved and then I reran the application.
2. While creating the gazetteer in the beginning we can not create a new "def" file by setting the "listsURL" to a non-existent "def" file as this will trigger an error. We do that later on by clicking the "New" button in the "Linear Definition" panel.
3. You can set in the gazetteer options that if several annotations overlap only the longest matters!
4. When we created the "animals.lst" file and started entering values, note that there was no indication anywhere in the interface on which file we were now working on.
5. Also keep in mind that by replacing the ANNIE gazetteer with our own instance, we have to put the new one at the same position as the old one in the pipeline. It was third place. The gazetteer uses some results from previous steps in the pipeline and other GATE processing resources expect results from the gazetteer, so position really matters!
Demo 2
In this demo:
- we see the use of Morphological Analyzer to get the root of a word
- and the use of Flexible Gazetteer to annotate all forms of a word
When we run the standard ANNIE gazetteer we only match "city" and not "cities" in the lookup annotations. We need to add the CREOLE plug-in directory "Tools" to enable both the Morphological Analyzer and Flexible Gazetteer. Next we add the Morphological Analyzer and now we get an additional feature to every token: "Token.root" which contains the root of a word which in our case is "city". The FG(Flexible Gazetteer) does not work the usual way. It does not process the annotations itself, but it works on a selected feature of a selected annotation. So by using the FG we need to select a gazetteer (as this a required parameter to create a FG) so we choose the ANNIE gazetteer to process the "Token.root" annotation.feature. By using the FG we make a standard ANNIE gazetteer see not "cities", but "city" instead. And because "city" is being recognized by default by the ANNIE gazetteer as location, so becomes and "cities".
Demo 2 shows what is the result only with the ANNIE gazetteer and the result from the joint work of the Morphological Analyzer and Flexible Gazetteer.
You may want to see the result produced only by the FG. To do that you need to give a name to the "outputAnnotationSetName" option in the FG, which is available when you click on the ANNIE application and then click on the FG processing resource on the right.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Thursday, May 28. 2009
Using a Gate application - KIM part 4
Summary:
This short article shows you how to integrate a GATE module(application) in Ontotext KIM and consume it through you own Java product.
1. Configuration:
KIM provides an API through RMI on default port 1099. This page provides everything you need to configure RMI and KIM.
Eclipse->Build Path->Add external libraries:
kim-api.jar
sesame-1.2.7-ONTO.jar
2. It is recommended that you use GATE provided by the KIM distribution. Use "startKIMGate.bat" in \kim-platform-2.4-SNAPSHOT\bin.
You need to create a "Conditional Corpus Pipeline" application in GATE, so that KIM can utilize it successfully. ANNIE is not such a type of GATE application, so you will get a type mismatch if you use ANNIE or a modified version of it. The trick is to create a new "Conditional Corpus Pipeline" application and add all ANNIE's processing resources, plus your own to the newly created "Conditional Corpus Pipeline" application. Then you need to make sure these resources are in the same order as they were in the ANNIE application! This problem has been fixed for version 3.0 and above, so you ca now use ANNIE or a modified version of it from KIM.
3. Save your application to \kim-platform-2.4-SNAPSHOT\context\default\resources\mycondapp.gapp. To do that: right-click on a GATE application and select "Save application state".
4. Edit the file \kim-platform-2.4-SNAPSHOT\config\nerc.properties and modify the line:
com.ontotext.kim.KIMConstants.IE_APP=IE.gapp
to
com.ontotext.kim.KIMConstants.IE_APP=IE.gapp,mycondapp.gapp
All applications are separated by comma.
5. Executing our GATE application from KIM:
import com.ontotext.*;
import com.ontotext.kim.client.GetService;
import com.ontotext.kim.client.KIMService;
import com.ontotext.kim.client.semanticannotation.SemanticAnnotationAPI;
public class KIM {
public static final String RMI_HOST = "localhost";//not used
public static final int RMI_PORT = 1099; //not used
public static void main(String[] args) {
try
{
KIMService serviceKim = GetService.from();
System.out.println("KIM Platform : " + serviceKim.getPlatformVersion());
System.out.println("KIM Server : " + serviceKim.getServerVersion());
System.out.println("KB Version : " + serviceKim.getKBVersion());
// obtain CorporaAPI and SemanticAnnotationAPI components
SemanticAnnotationAPI apiSemAnn1 = serviceKim.getSemanticAnnotationAPI("mycondapp.gapp");
String content =
"Blair and Bush ? are they doing the right thing for Iraq, America," +
" Europe, the Earth... for civilization... " +
"or just guided by their blinded eyes are in favor of the big coporations:" +
"enter here new unrecognized corporations with a clue suffix:" +
"MicroZoftRR Inc.";
apiSemAnn1.execute(content);
}
catch(Exception ex)
{
System.out.println(ex.getMessage());
}
System.out.println("Done!");
}
}
You can download this working sample from here.
Software versions: KIM 2.4, GATE 4.0 (integrated with KIM), Eclipse 3.2
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
This short article shows you how to integrate a GATE module(application) in Ontotext KIM and consume it through you own Java product.
1. Configuration:
KIM provides an API through RMI on default port 1099. This page provides everything you need to configure RMI and KIM.
Eclipse->Build Path->Add external libraries:
kim-api.jar
sesame-1.2.7-ONTO.jar
2. It is recommended that you use GATE provided by the KIM distribution. Use "startKIMGate.bat" in \kim-platform-2.4-SNAPSHOT\bin.
You need to create a "Conditional Corpus Pipeline" application in GATE, so that KIM can utilize it successfully. ANNIE is not such a type of GATE application, so you will get a type mismatch if you use ANNIE or a modified version of it. The trick is to create a new "Conditional Corpus Pipeline" application and add all ANNIE's processing resources, plus your own to the newly created "Conditional Corpus Pipeline" application. Then you need to make sure these resources are in the same order as they were in the ANNIE application! This problem has been fixed for version 3.0 and above, so you ca now use ANNIE or a modified version of it from KIM.
3. Save your application to \kim-platform-2.4-SNAPSHOT\context\default\resources\mycondapp.gapp. To do that: right-click on a GATE application and select "Save application state".
4. Edit the file \kim-platform-2.4-SNAPSHOT\config\nerc.properties and modify the line:
com.ontotext.kim.KIMConstants.IE_APP=IE.gapp
to
com.ontotext.kim.KIMConstants.IE_APP=IE.gapp,mycondapp.gapp
All applications are separated by comma.
5. Executing our GATE application from KIM:
import com.ontotext.*;
import com.ontotext.kim.client.GetService;
import com.ontotext.kim.client.KIMService;
import com.ontotext.kim.client.semanticannotation.SemanticAnnotationAPI;
public class KIM {
public static final String RMI_HOST = "localhost";//not used
public static final int RMI_PORT = 1099; //not used
public static void main(String[] args) {
try
{
KIMService serviceKim = GetService.from();
System.out.println("KIM Platform : " + serviceKim.getPlatformVersion());
System.out.println("KIM Server : " + serviceKim.getServerVersion());
System.out.println("KB Version : " + serviceKim.getKBVersion());
// obtain CorporaAPI and SemanticAnnotationAPI components
SemanticAnnotationAPI apiSemAnn1 = serviceKim.getSemanticAnnotationAPI("mycondapp.gapp");
String content =
"Blair and Bush ? are they doing the right thing for Iraq, America," +
" Europe, the Earth... for civilization... " +
"or just guided by their blinded eyes are in favor of the big coporations:" +
"enter here new unrecognized corporations with a clue suffix:" +
"MicroZoftRR Inc.";
apiSemAnn1.execute(content);
}
catch(Exception ex)
{
System.out.println(ex.getMessage());
}
System.out.println("Done!");
}
}
You can download this working sample from here.
Software versions: KIM 2.4, GATE 4.0 (integrated with KIM), Eclipse 3.2
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Posted by Anton Andreev
in Techno-talk
at
21:46
| Comments (2)
| Trackbacks (0)
Last modified on 2009-11-17 13:47
Monday, May 18. 2009
GATE tutorial - KIM part 3
Summary:
This a beginners GATE tutorial.
GATE is a tool for (NLP)Natural Language Processing. GATE helps you extract data from text articles, which you can turn into a computer knowledge. It provides you a development IDE that helps you create and test an application. Once you are done you can have your application executed from JAVA the same way you did from the IDE. GATE Applications ca be incorporated in Ontotext KIM
First you should read the user guide. Also I am using GATE-5.0-beta1 build 3048, Eclipse 3.4.2(used in the Java sample) on Windows XP SP3.
Let's say we want to find the relation when company A acquires company B .
Gather some example articles. Create corpus. "Corpus" is just a funny name for a group of articles. My articles are here.
We are going to focus on English articles. GATE gives you the ability to create you own text processing applications. Maybe there are already such GATE applications that are good and can be used for general purpose text processing. "Application" is GATE term, we are not talking about applications in general. The point is that we could create an application from scratch, it is not that hard, but it is always better and most of all easier to improve upon something.
We choose a GATE application which is called "ANNIE". Indeed ANNIE is not some application, it is an integral part of GATE itself. You should try to process some articles with ANNIE, see the Annotation Sets and Annotation List and get used to them. Keep in mind ANNIE is primary English orientated.
Next we need to add additional functionality to ANNIE. We could add a lot of different stuff, but we add a "Jape Transducer" which points to a file where we describe what should be detected in our articles. That file is a "Jape file". Don't think what is Jape right now. Next click on ANNIE. You will see some processing resources (on the right). The Jape transducer we've just created is such a processing resource. We need to add it to the right. You need to know that these processing resources work at different levels and each can depend from the output of others. That's the meaning of "pipe" in GATE. So it will be best if we leave our new processing resources as last (the bottom) on the list.
Yes, next is Jape. Jape is a language similar to regular expressions. We are going to use acquire.jape. It has two rules. Also I did put it in \GATE-5.0-beta1\plugins\ANNIE\resources\NE\grammar\acquire.jape.
I have also modified ANNIE Gazetteer in \GATE-5.0-beta1\plugins\ANNIE\resources\gazetteer\company.lst by adding two lines:
MySQL
MySql
to make sure MySql is recognized as a company.
The whole idea is to make your enhanced ANNIE work by supplying a correct Jape grammar and test it. Then you save your application to a file. You do that by right-clicking on a GATE application and select "Save application state".
1. You should save your application with "gapp" extension (no problem if you do not).
2. It is better if you remove the corpus in your application before saving, cause that corpus will become one more dependency to your application.
Gapp files are simply XML files which describe where is everything you use in your GATE application. This means you can change them. For what? You won't need to modify them as far as you continue to use the gapp file/application form the location where you saved it and you did not change the location of your GATE installation. If you change myapp.gapp from c:\gatetest to d:\work\gatetest you will see that things will probably go wrong. Modifying the paths is easy.
Next we create a normal Java console application. We add all jars in gate/lib. We check to make sure we added ALL the jars! I was having problems because I've ignored some.
And then what?
Thank you about this question. Then we use this java code:
import gate.Annotation;
import gate.Document;
import gate.Corpus;
import gate.CorpusController;
import gate.FeatureMap;
import gate.AnnotationSet;
import gate.Gate;
import gate.Factory;
import gate.util.*;
import gate.util.persistence.PersistenceManager;
import java.util.Set;
import java.util.HashSet;
import java.util.List;
import java.util.ArrayList;
import java.util.Iterator;
import java.io.File;
import java.io.FileFilter;
import java.io.FileOutputStream;
import java.io.BufferedOutputStream;
import java.io.FilenameFilter;
import java.io.OutputStreamWriter;
public class BatchProcessApp {
public static void main(String[] args) throws Exception {
// initialise GATE - this must be done before calling any GATE APIs
Gate.init();
File[] files = getFilesFromDir("F:/Temp/articles");
// File gappFile = new File("g:/ModifiedAnnie.gapp");
File gappFile = new File("g:/annie_acquire_nocorpus.gapp");
// load the saved application
CorpusController application = (CorpusController) PersistenceManager
.loadObjectFromFile(gappFile);
// Create a Corpus to use. We recycle the same Corpus object for each
// iteration. The string parameter to newCorpus() is simply the
// GATE-internal name to use for the corpus. It has no particular
// significance.
Corpus corpus = Factory.newCorpus("BatchProcessApp Corpus");
application.setCorpus(corpus);
// process the files one by one
for (int i = 0; i < files.length; i++) {
if (!files[i].getName().endsWith(".txt"))
continue;
// load the document (using the specified encoding if one was given)
File docFile = files[i];// new File(args[i]);
System.out.print("Processing document " + docFile + "...");
Document doc = Factory.newDocument(docFile.toURL(), encoding);
// put the document in the corpus
corpus.add(doc);
// run the application
application.execute();
// remove the document from the corpus again
corpus.clear();
// we only extract annotations from the default (unnamed)
// AnnotationSet
// in this example
AnnotationSet defaultAnnots = doc.getAnnotations();
System.out.println();
for (Annotation ann : defaultAnnots) {
FeatureMap map = ann.getFeatures();
if (map.get("relationType") != null)
System.out.println("## " + map.get("relationType")
+ " #CompanyA=" + map.get("companyA")
+ " #CompanyB=" + map.get("companyB")
);
}
Factory.deleteResource(doc);
System.out.println("done");
} // for each file
System.out.println("All done");
}
private static String encoding = null;
private static File[] getFilesFromDir(String path) {
File dir = new File(path);
File[] files = dir.listFiles();
return files;
}
}
You can view/download the source from code.google.com
The Gate.init() should be called only once! To run this code you need to set the path to your gapp file and the location of your articles(no folder recursive scanning). Note that we could load all documents in one corpus, but instead the code loads only one document per corpus, this helps system resources to be utilized better. Also this code sample will display only annotations that have a feature="relationType". You should make it display everything.
You can see that the code instantiates the GAPE application (a modified ANNIE) in the form of a "CorpusController" and it is named "application"
. It is something like Java/.NET remoting, but you point it to a gapp file, which has all the meta information to construct the object/the application.
Conclusion:
Data extraction with GATE can be done, it just needs reading through documentation, post questions to the GATE mailing list.
Maybe it is a good idea to create a Linux(or FreeBSD, OpenSolaris, Nexenta) vmware image(or Xen, VirtualBox) which has GATE, Eclipse and GATE's samples installed and working properly.
Disclaimer:
This Java sample is based on this sample.
I am a GATE newbie, so do not expect for now that I would be able to answer your questions.
Credits:
Special thanks goes to: everyone from the GATE mailing list, Marin Nozhchev(Ontotext), Stanislav Zlatinov.
Todo:
Explain the Jape code.
Add my articles.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
This a beginners GATE tutorial.
GATE is a tool for (NLP)Natural Language Processing. GATE helps you extract data from text articles, which you can turn into a computer knowledge. It provides you a development IDE that helps you create and test an application. Once you are done you can have your application executed from JAVA the same way you did from the IDE. GATE Applications ca be incorporated in Ontotext KIM
First you should read the user guide. Also I am using GATE-5.0-beta1 build 3048, Eclipse 3.4.2(used in the Java sample) on Windows XP SP3.
Let's say we want to find the relation when company A acquires company B .
Gather some example articles. Create corpus. "Corpus" is just a funny name for a group of articles. My articles are here.
We are going to focus on English articles. GATE gives you the ability to create you own text processing applications. Maybe there are already such GATE applications that are good and can be used for general purpose text processing. "Application" is GATE term, we are not talking about applications in general. The point is that we could create an application from scratch, it is not that hard, but it is always better and most of all easier to improve upon something.
We choose a GATE application which is called "ANNIE". Indeed ANNIE is not some application, it is an integral part of GATE itself. You should try to process some articles with ANNIE, see the Annotation Sets and Annotation List and get used to them. Keep in mind ANNIE is primary English orientated.
Next we need to add additional functionality to ANNIE. We could add a lot of different stuff, but we add a "Jape Transducer" which points to a file where we describe what should be detected in our articles. That file is a "Jape file". Don't think what is Jape right now. Next click on ANNIE. You will see some processing resources (on the right). The Jape transducer we've just created is such a processing resource. We need to add it to the right. You need to know that these processing resources work at different levels and each can depend from the output of others. That's the meaning of "pipe" in GATE. So it will be best if we leave our new processing resources as last (the bottom) on the list.
Yes, next is Jape. Jape is a language similar to regular expressions. We are going to use acquire.jape. It has two rules. Also I did put it in \GATE-5.0-beta1\plugins\ANNIE\resources\NE\grammar\acquire.jape.
I have also modified ANNIE Gazetteer in \GATE-5.0-beta1\plugins\ANNIE\resources\gazetteer\company.lst by adding two lines:
MySQL
MySql
to make sure MySql is recognized as a company.
The whole idea is to make your enhanced ANNIE work by supplying a correct Jape grammar and test it. Then you save your application to a file. You do that by right-clicking on a GATE application and select "Save application state".
1. You should save your application with "gapp" extension (no problem if you do not).
2. It is better if you remove the corpus in your application before saving, cause that corpus will become one more dependency to your application.
Gapp files are simply XML files which describe where is everything you use in your GATE application. This means you can change them. For what? You won't need to modify them as far as you continue to use the gapp file/application form the location where you saved it and you did not change the location of your GATE installation. If you change myapp.gapp from c:\gatetest to d:\work\gatetest you will see that things will probably go wrong. Modifying the paths is easy.
Next we create a normal Java console application. We add all jars in gate/lib. We check to make sure we added ALL the jars! I was having problems because I've ignored some.
And then what?
Thank you about this question. Then we use this java code:
import gate.Annotation;
import gate.Document;
import gate.Corpus;
import gate.CorpusController;
import gate.FeatureMap;
import gate.AnnotationSet;
import gate.Gate;
import gate.Factory;
import gate.util.*;
import gate.util.persistence.PersistenceManager;
import java.util.Set;
import java.util.HashSet;
import java.util.List;
import java.util.ArrayList;
import java.util.Iterator;
import java.io.File;
import java.io.FileFilter;
import java.io.FileOutputStream;
import java.io.BufferedOutputStream;
import java.io.FilenameFilter;
import java.io.OutputStreamWriter;
public class BatchProcessApp {
public static void main(String[] args) throws Exception {
// initialise GATE - this must be done before calling any GATE APIs
Gate.init();
File[] files = getFilesFromDir("F:/Temp/articles");
// File gappFile = new File("g:/ModifiedAnnie.gapp");
File gappFile = new File("g:/annie_acquire_nocorpus.gapp");
// load the saved application
CorpusController application = (CorpusController) PersistenceManager
.loadObjectFromFile(gappFile);
// Create a Corpus to use. We recycle the same Corpus object for each
// iteration. The string parameter to newCorpus() is simply the
// GATE-internal name to use for the corpus. It has no particular
// significance.
Corpus corpus = Factory.newCorpus("BatchProcessApp Corpus");
application.setCorpus(corpus);
// process the files one by one
for (int i = 0; i < files.length; i++) {
if (!files[i].getName().endsWith(".txt"))
continue;
// load the document (using the specified encoding if one was given)
File docFile = files[i];// new File(args[i]);
System.out.print("Processing document " + docFile + "...");
Document doc = Factory.newDocument(docFile.toURL(), encoding);
// put the document in the corpus
corpus.add(doc);
// run the application
application.execute();
// remove the document from the corpus again
corpus.clear();
// we only extract annotations from the default (unnamed)
// AnnotationSet
// in this example
AnnotationSet defaultAnnots = doc.getAnnotations();
System.out.println();
for (Annotation ann : defaultAnnots) {
FeatureMap map = ann.getFeatures();
if (map.get("relationType") != null)
System.out.println("## " + map.get("relationType")
+ " #CompanyA=" + map.get("companyA")
+ " #CompanyB=" + map.get("companyB")
);
}
Factory.deleteResource(doc);
System.out.println("done");
} // for each file
System.out.println("All done");
}
private static String encoding = null;
private static File[] getFilesFromDir(String path) {
File dir = new File(path);
File[] files = dir.listFiles();
return files;
}
}
You can view/download the source from code.google.com
The Gate.init() should be called only once! To run this code you need to set the path to your gapp file and the location of your articles(no folder recursive scanning). Note that we could load all documents in one corpus, but instead the code loads only one document per corpus, this helps system resources to be utilized better. Also this code sample will display only annotations that have a feature="relationType". You should make it display everything.
You can see that the code instantiates the GAPE application (a modified ANNIE) in the form of a "CorpusController" and it is named "application"
Conclusion:
Data extraction with GATE can be done, it just needs reading through documentation, post questions to the GATE mailing list.
Maybe it is a good idea to create a Linux(or FreeBSD, OpenSolaris, Nexenta) vmware image(or Xen, VirtualBox) which has GATE, Eclipse and GATE's samples installed and working properly.
Disclaimer:
This Java sample is based on this sample.
I am a GATE newbie, so do not expect for now that I would be able to answer your questions.
Credits:
Special thanks goes to: everyone from the GATE mailing list, Marin Nozhchev(Ontotext), Stanislav Zlatinov.
Todo:
Explain the Jape code.
Add my articles.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Posted by Anton Andreev
in Techno-talk
at
13:02
| Comment (1)
| Trackbacks (0)
Last modified on 2009-11-17 13:47
(Page 1 of 1, totaling 7 entries)





