This post explains some steps of how to download and set up the GATE source code. This is needed when you would like to improve something.
SVN checkout: https://gate.svn.sourceforge.net/svnroot/gate
Open Eclipse in a new workspace (recommended). Use File->Import->Existing Projects into workspace->select your gate source dir->Finish.
Use the "Java Element Filters" to hide all "Non-Java elements" to make your project more compact.
Update:
My problems were related to an error that prevented me from downloading all GATE files while svn checking out the source. The problem is due to the fact that a filename allowed on Linux is not allowed on Windows. It was about ".cow:no-iframe" and ":" is not allowed on Windows. This halts the entire svn checkout and made me do all sort of tweaks and patches. The GATE source is over 500 MB and 13 000 files, so make sure you have everything before trying to fix it like me. If you are having problems like me, you should try a checkout on Linux to see if it is an OS dependency. If you copy source from Linux to Windows then you need to check in Eclipse that Properties->Resource->Text File Encoding->other is set to "UTF8".
I am interested in refactoring the code of the Othomatcher. This is the processing resource that matches all annotations of the same entity that they are indeed referring to the same thing. This is needed when a person or an organization is mentioned in different forms in the same document.
Entries tagged as eclipse
Thursday, December 17. 2009
Compiling GATE
Sunday, May 31. 2009
Strict Rules vs Machine Learning - KIM part 6
Summary:
There are generally two ways to recognize entities from text articles when using Ontotext Kim. Example entities are: people, organizations, locations.
Both methods have their strengths and weaknesses. Things that can not detected by humans can not also be detected by computers.
Using strict rules(better known as Knowledge Engineering)
These rules are implemented by some regular expression language. In this case it is Jape.
The more you customize the rules to detect what you need, the better results you get.
Advantages:
- it you have some rules already available (e.x. for date, money ...) then it might be faster to create the new rules you need and get the job done fast
- a smaller sample corpus might be required in some cases than when using machine learning
- in general effectiveness is bound to the amount of efforts that are needed to produce better rules
Weaknesses:
- in practice the rules might become quite complicated and hard to support. Imagine a 20KB file that describes only one entity. You end up not reading the previous rules and modifying one of them, but rather adding the specific case that was missing in the end of the file and thus increasing the total length of the file and the total complexity of the rules. This is especially true when different people are modifying these rules.
Machine learning
In order to use machine learning you need a framework that implement several machine learning algorithms. You as an expert can define features which will be taken in consideration when the framework is processing the example data:
- consider the length of the word
- consider the case-sensitivity
- consider the case-sensitivity of the previous word
- consider prefixes and suffixes
The idea is not to set the exact rules, but rather make the framework build them itself from specific parts of the text you told the framework to pay attention to. Then you need to supply the machine-learning framework with enough test articles.
Advantages:
- it can give better results than strict rules
Often efforts needed to achieve 80% effectiveness are as much as from 80% to 85%.
Weaknesses:
- needs parameter and algorithm testing (that's actually not such a problem, it just needs some work hours)
- needs more example articles by a factor of 10(assumption) than using strict rules
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. Features in the case of MALLET are either true or false. MALLET provides several algorithms and each of them has its own configuration parameters. MALLET is implemented as plug-in for KIM/GATE.
There is also a second machine learning framework called openNLP that is also implemented as GATE plug-in in Ontotext KIM, and might soon be released as part of the standard KIM/GATE release.
Conclusions/final thoughts:
Both rule-based and machine learning are supported by the custom GATE pipeline for semantic annotation developed by Ontotext for the KIM platform.
In short term it is better to use strict regular expression rules(like Jape) as it gives you results almost momentarily, but in a long term (from both complexity and effectiveness point of view) is definitely better to use a machine learning system like MALLET. Of course a combination of the two should work best, where rules are used first(some of them negative) and then machine learning is applied.
Another machine learning project is: edlin.sourceforge.net
This a short introduction, contact Ontotext for more detailed information.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
There are generally two ways to recognize entities from text articles when using Ontotext Kim. Example entities are: people, organizations, locations.
Both methods have their strengths and weaknesses. Things that can not detected by humans can not also be detected by computers.
Using strict rules(better known as Knowledge Engineering)
These rules are implemented by some regular expression language. In this case it is Jape.
The more you customize the rules to detect what you need, the better results you get.
Advantages:
- it you have some rules already available (e.x. for date, money ...) then it might be faster to create the new rules you need and get the job done fast
- a smaller sample corpus might be required in some cases than when using machine learning
- in general effectiveness is bound to the amount of efforts that are needed to produce better rules
Weaknesses:
- in practice the rules might become quite complicated and hard to support. Imagine a 20KB file that describes only one entity. You end up not reading the previous rules and modifying one of them, but rather adding the specific case that was missing in the end of the file and thus increasing the total length of the file and the total complexity of the rules. This is especially true when different people are modifying these rules.
Machine learning
In order to use machine learning you need a framework that implement several machine learning algorithms. You as an expert can define features which will be taken in consideration when the framework is processing the example data:
- consider the length of the word
- consider the case-sensitivity
- consider the case-sensitivity of the previous word
- consider prefixes and suffixes
The idea is not to set the exact rules, but rather make the framework build them itself from specific parts of the text you told the framework to pay attention to. Then you need to supply the machine-learning framework with enough test articles.
Advantages:
- it can give better results than strict rules
Often efforts needed to achieve 80% effectiveness are as much as from 80% to 85%.
Weaknesses:
- needs parameter and algorithm testing (that's actually not such a problem, it just needs some work hours)
- needs more example articles by a factor of 10(assumption) than using strict rules
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. Features in the case of MALLET are either true or false. MALLET provides several algorithms and each of them has its own configuration parameters. MALLET is implemented as plug-in for KIM/GATE.
There is also a second machine learning framework called openNLP that is also implemented as GATE plug-in in Ontotext KIM, and might soon be released as part of the standard KIM/GATE release.
Conclusions/final thoughts:
Both rule-based and machine learning are supported by the custom GATE pipeline for semantic annotation developed by Ontotext for the KIM platform.
In short term it is better to use strict regular expression rules(like Jape) as it gives you results almost momentarily, but in a long term (from both complexity and effectiveness point of view) is definitely better to use a machine learning system like MALLET. Of course a combination of the two should work best, where rules are used first(some of them negative) and then machine learning is applied.
Another machine learning project is: edlin.sourceforge.net
This a short introduction, contact Ontotext for more detailed information.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Posted by Anton Andreev
in Techno-talk
at
10:32
| Comments (0)
| Trackbacks (0)
Last modified on 2009-11-26 14:34
Thursday, May 28. 2009
Using a Gate application - KIM part 4
Summary:
This short article shows you how to integrate a GATE module(application) in Ontotext KIM and consume it through you own Java product.
1. Configuration:
KIM provides an API through RMI on default port 1099. This page provides everything you need to configure RMI and KIM.
Eclipse->Build Path->Add external libraries:
kim-api.jar
sesame-1.2.7-ONTO.jar
2. It is recommended that you use GATE provided by the KIM distribution. Use "startKIMGate.bat" in \kim-platform-2.4-SNAPSHOT\bin.
You need to create a "Conditional Corpus Pipeline" application in GATE, so that KIM can utilize it successfully. ANNIE is not such a type of GATE application, so you will get a type mismatch if you use ANNIE or a modified version of it. The trick is to create a new "Conditional Corpus Pipeline" application and add all ANNIE's processing resources, plus your own to the newly created "Conditional Corpus Pipeline" application. Then you need to make sure these resources are in the same order as they were in the ANNIE application! This problem has been fixed for version 3.0 and above, so you ca now use ANNIE or a modified version of it from KIM.
3. Save your application to \kim-platform-2.4-SNAPSHOT\context\default\resources\mycondapp.gapp. To do that: right-click on a GATE application and select "Save application state".
4. Edit the file \kim-platform-2.4-SNAPSHOT\config\nerc.properties and modify the line:
com.ontotext.kim.KIMConstants.IE_APP=IE.gapp
to
com.ontotext.kim.KIMConstants.IE_APP=IE.gapp,mycondapp.gapp
All applications are separated by comma.
5. Executing our GATE application from KIM:
import com.ontotext.*;
import com.ontotext.kim.client.GetService;
import com.ontotext.kim.client.KIMService;
import com.ontotext.kim.client.semanticannotation.SemanticAnnotationAPI;
public class KIM {
public static final String RMI_HOST = "localhost";//not used
public static final int RMI_PORT = 1099; //not used
public static void main(String[] args) {
try
{
KIMService serviceKim = GetService.from();
System.out.println("KIM Platform : " + serviceKim.getPlatformVersion());
System.out.println("KIM Server : " + serviceKim.getServerVersion());
System.out.println("KB Version : " + serviceKim.getKBVersion());
// obtain CorporaAPI and SemanticAnnotationAPI components
SemanticAnnotationAPI apiSemAnn1 = serviceKim.getSemanticAnnotationAPI("mycondapp.gapp");
String content =
"Blair and Bush ? are they doing the right thing for Iraq, America," +
" Europe, the Earth... for civilization... " +
"or just guided by their blinded eyes are in favor of the big coporations:" +
"enter here new unrecognized corporations with a clue suffix:" +
"MicroZoftRR Inc.";
apiSemAnn1.execute(content);
}
catch(Exception ex)
{
System.out.println(ex.getMessage());
}
System.out.println("Done!");
}
}
You can download this working sample from here.
Software versions: KIM 2.4, GATE 4.0 (integrated with KIM), Eclipse 3.2
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
This short article shows you how to integrate a GATE module(application) in Ontotext KIM and consume it through you own Java product.
1. Configuration:
KIM provides an API through RMI on default port 1099. This page provides everything you need to configure RMI and KIM.
Eclipse->Build Path->Add external libraries:
kim-api.jar
sesame-1.2.7-ONTO.jar
2. It is recommended that you use GATE provided by the KIM distribution. Use "startKIMGate.bat" in \kim-platform-2.4-SNAPSHOT\bin.
You need to create a "Conditional Corpus Pipeline" application in GATE, so that KIM can utilize it successfully. ANNIE is not such a type of GATE application, so you will get a type mismatch if you use ANNIE or a modified version of it. The trick is to create a new "Conditional Corpus Pipeline" application and add all ANNIE's processing resources, plus your own to the newly created "Conditional Corpus Pipeline" application. Then you need to make sure these resources are in the same order as they were in the ANNIE application! This problem has been fixed for version 3.0 and above, so you ca now use ANNIE or a modified version of it from KIM.
3. Save your application to \kim-platform-2.4-SNAPSHOT\context\default\resources\mycondapp.gapp. To do that: right-click on a GATE application and select "Save application state".
4. Edit the file \kim-platform-2.4-SNAPSHOT\config\nerc.properties and modify the line:
com.ontotext.kim.KIMConstants.IE_APP=IE.gapp
to
com.ontotext.kim.KIMConstants.IE_APP=IE.gapp,mycondapp.gapp
All applications are separated by comma.
5. Executing our GATE application from KIM:
import com.ontotext.*;
import com.ontotext.kim.client.GetService;
import com.ontotext.kim.client.KIMService;
import com.ontotext.kim.client.semanticannotation.SemanticAnnotationAPI;
public class KIM {
public static final String RMI_HOST = "localhost";//not used
public static final int RMI_PORT = 1099; //not used
public static void main(String[] args) {
try
{
KIMService serviceKim = GetService.from();
System.out.println("KIM Platform : " + serviceKim.getPlatformVersion());
System.out.println("KIM Server : " + serviceKim.getServerVersion());
System.out.println("KB Version : " + serviceKim.getKBVersion());
// obtain CorporaAPI and SemanticAnnotationAPI components
SemanticAnnotationAPI apiSemAnn1 = serviceKim.getSemanticAnnotationAPI("mycondapp.gapp");
String content =
"Blair and Bush ? are they doing the right thing for Iraq, America," +
" Europe, the Earth... for civilization... " +
"or just guided by their blinded eyes are in favor of the big coporations:" +
"enter here new unrecognized corporations with a clue suffix:" +
"MicroZoftRR Inc.";
apiSemAnn1.execute(content);
}
catch(Exception ex)
{
System.out.println(ex.getMessage());
}
System.out.println("Done!");
}
}
You can download this working sample from here.
Software versions: KIM 2.4, GATE 4.0 (integrated with KIM), Eclipse 3.2
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Posted by Anton Andreev
in Techno-talk
at
21:46
| Comments (2)
| Trackbacks (0)
Last modified on 2009-11-17 13:47
Monday, May 18. 2009
GATE tutorial - KIM part 3
Summary:
This a beginners GATE tutorial.
GATE is a tool for (NLP)Natural Language Processing. GATE helps you extract data from text articles, which you can turn into a computer knowledge. It provides you a development IDE that helps you create and test an application. Once you are done you can have your application executed from JAVA the same way you did from the IDE. GATE Applications ca be incorporated in Ontotext KIM
First you should read the user guide. Also I am using GATE-5.0-beta1 build 3048, Eclipse 3.4.2(used in the Java sample) on Windows XP SP3.
Let's say we want to find the relation when company A acquires company B .
Gather some example articles. Create corpus. "Corpus" is just a funny name for a group of articles. My articles are here.
We are going to focus on English articles. GATE gives you the ability to create you own text processing applications. Maybe there are already such GATE applications that are good and can be used for general purpose text processing. "Application" is GATE term, we are not talking about applications in general. The point is that we could create an application from scratch, it is not that hard, but it is always better and most of all easier to improve upon something.
We choose a GATE application which is called "ANNIE". Indeed ANNIE is not some application, it is an integral part of GATE itself. You should try to process some articles with ANNIE, see the Annotation Sets and Annotation List and get used to them. Keep in mind ANNIE is primary English orientated.
Next we need to add additional functionality to ANNIE. We could add a lot of different stuff, but we add a "Jape Transducer" which points to a file where we describe what should be detected in our articles. That file is a "Jape file". Don't think what is Jape right now. Next click on ANNIE. You will see some processing resources (on the right). The Jape transducer we've just created is such a processing resource. We need to add it to the right. You need to know that these processing resources work at different levels and each can depend from the output of others. That's the meaning of "pipe" in GATE. So it will be best if we leave our new processing resources as last (the bottom) on the list.
Yes, next is Jape. Jape is a language similar to regular expressions. We are going to use acquire.jape. It has two rules. Also I did put it in \GATE-5.0-beta1\plugins\ANNIE\resources\NE\grammar\acquire.jape.
I have also modified ANNIE Gazetteer in \GATE-5.0-beta1\plugins\ANNIE\resources\gazetteer\company.lst by adding two lines:
MySQL
MySql
to make sure MySql is recognized as a company.
The whole idea is to make your enhanced ANNIE work by supplying a correct Jape grammar and test it. Then you save your application to a file. You do that by right-clicking on a GATE application and select "Save application state".
1. You should save your application with "gapp" extension (no problem if you do not).
2. It is better if you remove the corpus in your application before saving, cause that corpus will become one more dependency to your application.
Gapp files are simply XML files which describe where is everything you use in your GATE application. This means you can change them. For what? You won't need to modify them as far as you continue to use the gapp file/application form the location where you saved it and you did not change the location of your GATE installation. If you change myapp.gapp from c:\gatetest to d:\work\gatetest you will see that things will probably go wrong. Modifying the paths is easy.
Next we create a normal Java console application. We add all jars in gate/lib. We check to make sure we added ALL the jars! I was having problems because I've ignored some.
And then what?
Thank you about this question. Then we use this java code:
import gate.Annotation;
import gate.Document;
import gate.Corpus;
import gate.CorpusController;
import gate.FeatureMap;
import gate.AnnotationSet;
import gate.Gate;
import gate.Factory;
import gate.util.*;
import gate.util.persistence.PersistenceManager;
import java.util.Set;
import java.util.HashSet;
import java.util.List;
import java.util.ArrayList;
import java.util.Iterator;
import java.io.File;
import java.io.FileFilter;
import java.io.FileOutputStream;
import java.io.BufferedOutputStream;
import java.io.FilenameFilter;
import java.io.OutputStreamWriter;
public class BatchProcessApp {
public static void main(String[] args) throws Exception {
// initialise GATE - this must be done before calling any GATE APIs
Gate.init();
File[] files = getFilesFromDir("F:/Temp/articles");
// File gappFile = new File("g:/ModifiedAnnie.gapp");
File gappFile = new File("g:/annie_acquire_nocorpus.gapp");
// load the saved application
CorpusController application = (CorpusController) PersistenceManager
.loadObjectFromFile(gappFile);
// Create a Corpus to use. We recycle the same Corpus object for each
// iteration. The string parameter to newCorpus() is simply the
// GATE-internal name to use for the corpus. It has no particular
// significance.
Corpus corpus = Factory.newCorpus("BatchProcessApp Corpus");
application.setCorpus(corpus);
// process the files one by one
for (int i = 0; i < files.length; i++) {
if (!files[i].getName().endsWith(".txt"))
continue;
// load the document (using the specified encoding if one was given)
File docFile = files[i];// new File(args[i]);
System.out.print("Processing document " + docFile + "...");
Document doc = Factory.newDocument(docFile.toURL(), encoding);
// put the document in the corpus
corpus.add(doc);
// run the application
application.execute();
// remove the document from the corpus again
corpus.clear();
// we only extract annotations from the default (unnamed)
// AnnotationSet
// in this example
AnnotationSet defaultAnnots = doc.getAnnotations();
System.out.println();
for (Annotation ann : defaultAnnots) {
FeatureMap map = ann.getFeatures();
if (map.get("relationType") != null)
System.out.println("## " + map.get("relationType")
+ " #CompanyA=" + map.get("companyA")
+ " #CompanyB=" + map.get("companyB")
);
}
Factory.deleteResource(doc);
System.out.println("done");
} // for each file
System.out.println("All done");
}
private static String encoding = null;
private static File[] getFilesFromDir(String path) {
File dir = new File(path);
File[] files = dir.listFiles();
return files;
}
}
You can view/download the source from code.google.com
The Gate.init() should be called only once! To run this code you need to set the path to your gapp file and the location of your articles(no folder recursive scanning). Note that we could load all documents in one corpus, but instead the code loads only one document per corpus, this helps system resources to be utilized better. Also this code sample will display only annotations that have a feature="relationType". You should make it display everything.
You can see that the code instantiates the GAPE application (a modified ANNIE) in the form of a "CorpusController" and it is named "application"
. It is something like Java/.NET remoting, but you point it to a gapp file, which has all the meta information to construct the object/the application.
Conclusion:
Data extraction with GATE can be done, it just needs reading through documentation, post questions to the GATE mailing list.
Maybe it is a good idea to create a Linux(or FreeBSD, OpenSolaris, Nexenta) vmware image(or Xen, VirtualBox) which has GATE, Eclipse and GATE's samples installed and working properly.
Disclaimer:
This Java sample is based on this sample.
I am a GATE newbie, so do not expect for now that I would be able to answer your questions.
Credits:
Special thanks goes to: everyone from the GATE mailing list, Marin Nozhchev(Ontotext), Stanislav Zlatinov.
Todo:
Explain the Jape code.
Add my articles.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
This a beginners GATE tutorial.
GATE is a tool for (NLP)Natural Language Processing. GATE helps you extract data from text articles, which you can turn into a computer knowledge. It provides you a development IDE that helps you create and test an application. Once you are done you can have your application executed from JAVA the same way you did from the IDE. GATE Applications ca be incorporated in Ontotext KIM
First you should read the user guide. Also I am using GATE-5.0-beta1 build 3048, Eclipse 3.4.2(used in the Java sample) on Windows XP SP3.
Let's say we want to find the relation when company A acquires company B .
Gather some example articles. Create corpus. "Corpus" is just a funny name for a group of articles. My articles are here.
We are going to focus on English articles. GATE gives you the ability to create you own text processing applications. Maybe there are already such GATE applications that are good and can be used for general purpose text processing. "Application" is GATE term, we are not talking about applications in general. The point is that we could create an application from scratch, it is not that hard, but it is always better and most of all easier to improve upon something.
We choose a GATE application which is called "ANNIE". Indeed ANNIE is not some application, it is an integral part of GATE itself. You should try to process some articles with ANNIE, see the Annotation Sets and Annotation List and get used to them. Keep in mind ANNIE is primary English orientated.
Next we need to add additional functionality to ANNIE. We could add a lot of different stuff, but we add a "Jape Transducer" which points to a file where we describe what should be detected in our articles. That file is a "Jape file". Don't think what is Jape right now. Next click on ANNIE. You will see some processing resources (on the right). The Jape transducer we've just created is such a processing resource. We need to add it to the right. You need to know that these processing resources work at different levels and each can depend from the output of others. That's the meaning of "pipe" in GATE. So it will be best if we leave our new processing resources as last (the bottom) on the list.
Yes, next is Jape. Jape is a language similar to regular expressions. We are going to use acquire.jape. It has two rules. Also I did put it in \GATE-5.0-beta1\plugins\ANNIE\resources\NE\grammar\acquire.jape.
I have also modified ANNIE Gazetteer in \GATE-5.0-beta1\plugins\ANNIE\resources\gazetteer\company.lst by adding two lines:
MySQL
MySql
to make sure MySql is recognized as a company.
The whole idea is to make your enhanced ANNIE work by supplying a correct Jape grammar and test it. Then you save your application to a file. You do that by right-clicking on a GATE application and select "Save application state".
1. You should save your application with "gapp" extension (no problem if you do not).
2. It is better if you remove the corpus in your application before saving, cause that corpus will become one more dependency to your application.
Gapp files are simply XML files which describe where is everything you use in your GATE application. This means you can change them. For what? You won't need to modify them as far as you continue to use the gapp file/application form the location where you saved it and you did not change the location of your GATE installation. If you change myapp.gapp from c:\gatetest to d:\work\gatetest you will see that things will probably go wrong. Modifying the paths is easy.
Next we create a normal Java console application. We add all jars in gate/lib. We check to make sure we added ALL the jars! I was having problems because I've ignored some.
And then what?
Thank you about this question. Then we use this java code:
import gate.Annotation;
import gate.Document;
import gate.Corpus;
import gate.CorpusController;
import gate.FeatureMap;
import gate.AnnotationSet;
import gate.Gate;
import gate.Factory;
import gate.util.*;
import gate.util.persistence.PersistenceManager;
import java.util.Set;
import java.util.HashSet;
import java.util.List;
import java.util.ArrayList;
import java.util.Iterator;
import java.io.File;
import java.io.FileFilter;
import java.io.FileOutputStream;
import java.io.BufferedOutputStream;
import java.io.FilenameFilter;
import java.io.OutputStreamWriter;
public class BatchProcessApp {
public static void main(String[] args) throws Exception {
// initialise GATE - this must be done before calling any GATE APIs
Gate.init();
File[] files = getFilesFromDir("F:/Temp/articles");
// File gappFile = new File("g:/ModifiedAnnie.gapp");
File gappFile = new File("g:/annie_acquire_nocorpus.gapp");
// load the saved application
CorpusController application = (CorpusController) PersistenceManager
.loadObjectFromFile(gappFile);
// Create a Corpus to use. We recycle the same Corpus object for each
// iteration. The string parameter to newCorpus() is simply the
// GATE-internal name to use for the corpus. It has no particular
// significance.
Corpus corpus = Factory.newCorpus("BatchProcessApp Corpus");
application.setCorpus(corpus);
// process the files one by one
for (int i = 0; i < files.length; i++) {
if (!files[i].getName().endsWith(".txt"))
continue;
// load the document (using the specified encoding if one was given)
File docFile = files[i];// new File(args[i]);
System.out.print("Processing document " + docFile + "...");
Document doc = Factory.newDocument(docFile.toURL(), encoding);
// put the document in the corpus
corpus.add(doc);
// run the application
application.execute();
// remove the document from the corpus again
corpus.clear();
// we only extract annotations from the default (unnamed)
// AnnotationSet
// in this example
AnnotationSet defaultAnnots = doc.getAnnotations();
System.out.println();
for (Annotation ann : defaultAnnots) {
FeatureMap map = ann.getFeatures();
if (map.get("relationType") != null)
System.out.println("## " + map.get("relationType")
+ " #CompanyA=" + map.get("companyA")
+ " #CompanyB=" + map.get("companyB")
);
}
Factory.deleteResource(doc);
System.out.println("done");
} // for each file
System.out.println("All done");
}
private static String encoding = null;
private static File[] getFilesFromDir(String path) {
File dir = new File(path);
File[] files = dir.listFiles();
return files;
}
}
You can view/download the source from code.google.com
The Gate.init() should be called only once! To run this code you need to set the path to your gapp file and the location of your articles(no folder recursive scanning). Note that we could load all documents in one corpus, but instead the code loads only one document per corpus, this helps system resources to be utilized better. Also this code sample will display only annotations that have a feature="relationType". You should make it display everything.
You can see that the code instantiates the GAPE application (a modified ANNIE) in the form of a "CorpusController" and it is named "application"
Conclusion:
Data extraction with GATE can be done, it just needs reading through documentation, post questions to the GATE mailing list.
Maybe it is a good idea to create a Linux(or FreeBSD, OpenSolaris, Nexenta) vmware image(or Xen, VirtualBox) which has GATE, Eclipse and GATE's samples installed and working properly.
Disclaimer:
This Java sample is based on this sample.
I am a GATE newbie, so do not expect for now that I would be able to answer your questions.
Credits:
Special thanks goes to: everyone from the GATE mailing list, Marin Nozhchev(Ontotext), Stanislav Zlatinov.
Todo:
Explain the Jape code.
Add my articles.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0
Posted by Anton Andreev
in Techno-talk
at
13:02
| Comment (1)
| Trackbacks (0)
Last modified on 2009-11-17 13:47
(Page 1 of 1, totaling 4 entries)





