This a beginners GATE tutorial.
GATE is a tool for (NLP)Natural Language Processing. GATE helps you extract data from text articles, which you can turn into a computer knowledge. It provides you a development IDE that helps you create and test an application. Once you are done you can have your application executed from JAVA the same way you did from the IDE. GATE Applications ca be incorporated in Ontotext KIM
First you should read the user guide. Also I am using GATE-5.0-beta1 build 3048, Eclipse 3.4.2(used in the Java sample) on Windows XP SP3.
Let's say we want to find the relation when company A acquires company B .
Gather some example articles. Create corpus. "Corpus" is just a funny name for a group of articles. My articles are here.
We are going to focus on English articles. GATE gives you the ability to create you own text processing applications. Maybe there are already such GATE applications that are good and can be used for general purpose text processing. "Application" is GATE term, we are not talking about applications in general. The point is that we could create an application from scratch, it is not that hard, but it is always better and most of all easier to improve upon something.
We choose a GATE application which is called "ANNIE". Indeed ANNIE is not some application, it is an integral part of GATE itself. You should try to process some articles with ANNIE, see the Annotation Sets and Annotation List and get used to them. Keep in mind ANNIE is primary English orientated.
Next we need to add additional functionality to ANNIE. We could add a lot of different stuff, but we add a "Jape Transducer" which points to a file where we describe what should be detected in our articles. That file is a "Jape file". Don't think what is Jape right now. Next click on ANNIE. You will see some processing resources (on the right). The Jape transducer we've just created is such a processing resource. We need to add it to the right. You need to know that these processing resources work at different levels and each can depend from the output of others. That's the meaning of "pipe" in GATE. So it will be best if we leave our new processing resources as last (the bottom) on the list.
Yes, next is Jape. Jape is a language similar to regular expressions. We are going to use acquire.jape. It has two rules. Also I did put it in \GATE-5.0-beta1\plugins\ANNIE\resources\NE\grammar\acquire.jape.
I have also modified ANNIE Gazetteer in \GATE-5.0-beta1\plugins\ANNIE\resources\gazetteer\company.lst by adding two lines:
MySQL
MySql
to make sure MySql is recognized as a company.
The whole idea is to make your enhanced ANNIE work by supplying a correct Jape grammar and test it. Then you save your application to a file. You do that by right-clicking on a GATE application and select "Save application state".
1. You should save your application with "gapp" extension (no problem if you do not).
2. It is better if you remove the corpus in your application before saving, cause that corpus will become one more dependency to your application.
Gapp files are simply XML files which describe where is everything you use in your GATE application. This means you can change them. For what? You won't need to modify them as far as you continue to use the gapp file/application form the location where you saved it and you did not change the location of your GATE installation. If you change myapp.gapp from c:\gatetest to d:\work\gatetest you will see that things will probably go wrong. Modifying the paths is easy.
Next we create a normal Java console application. We add all jars in gate/lib. We check to make sure we added ALL the jars! I was having problems because I've ignored some.
And then what?
Thank you about this question. Then we use this java code:
import gate.Annotation;
import gate.Document;
import gate.Corpus;
import gate.CorpusController;
import gate.FeatureMap;
import gate.AnnotationSet;
import gate.Gate;
import gate.Factory;
import gate.util.*;
import gate.util.persistence.PersistenceManager;
import java.util.Set;
import java.util.HashSet;
import java.util.List;
import java.util.ArrayList;
import java.util.Iterator;
import java.io.File;
import java.io.FileFilter;
import java.io.FileOutputStream;
import java.io.BufferedOutputStream;
import java.io.FilenameFilter;
import java.io.OutputStreamWriter;
public class BatchProcessApp {
public static void main(String[] args) throws Exception {
// initialise GATE - this must be done before calling any GATE APIs
Gate.init();
File[] files = getFilesFromDir("F:/Temp/articles");
// File gappFile = new File("g:/ModifiedAnnie.gapp");
File gappFile = new File("g:/annie_acquire_nocorpus.gapp");
// load the saved application
CorpusController application = (CorpusController) PersistenceManager
.loadObjectFromFile(gappFile);
// Create a Corpus to use. We recycle the same Corpus object for each
// iteration. The string parameter to newCorpus() is simply the
// GATE-internal name to use for the corpus. It has no particular
// significance.
Corpus corpus = Factory.newCorpus("BatchProcessApp Corpus");
application.setCorpus(corpus);
// process the files one by one
for (int i = 0; i < files.length; i++) {
if (!files[i].getName().endsWith(".txt"))
continue;
// load the document (using the specified encoding if one was given)
File docFile = files[i];// new File(args[i]);
System.out.print("Processing document " + docFile + "...");
Document doc = Factory.newDocument(docFile.toURL(), encoding);
// put the document in the corpus
corpus.add(doc);
// run the application
application.execute();
// remove the document from the corpus again
corpus.clear();
// we only extract annotations from the default (unnamed)
// AnnotationSet
// in this example
AnnotationSet defaultAnnots = doc.getAnnotations();
System.out.println();
for (Annotation ann : defaultAnnots) {
FeatureMap map = ann.getFeatures();
if (map.get("relationType") != null)
System.out.println("## " + map.get("relationType")
+ " #CompanyA=" + map.get("companyA")
+ " #CompanyB=" + map.get("companyB")
);
}
Factory.deleteResource(doc);
System.out.println("done");
} // for each file
System.out.println("All done");
}
private static String encoding = null;
private static File[] getFilesFromDir(String path) {
File dir = new File(path);
File[] files = dir.listFiles();
return files;
}
}
You can view/download the source from code.google.com
The Gate.init() should be called only once! To run this code you need to set the path to your gapp file and the location of your articles(no folder recursive scanning). Note that we could load all documents in one corpus, but instead the code loads only one document per corpus, this helps system resources to be utilized better. Also this code sample will display only annotations that have a feature="relationType". You should make it display everything.
You can see that the code instantiates the GAPE application (a modified ANNIE) in the form of a "CorpusController" and it is named "application"
Conclusion:
Data extraction with GATE can be done, it just needs reading through documentation, post questions to the GATE mailing list.
Maybe it is a good idea to create a Linux(or FreeBSD, OpenSolaris, Nexenta) vmware image(or Xen, VirtualBox) which has GATE, Eclipse and GATE's samples installed and working properly.
Disclaimer:
This Java sample is based on this sample.
I am a GATE newbie, so do not expect for now that I would be able to answer your questions.
Credits:
Special thanks goes to: everyone from the GATE mailing list, Marin Nozhchev(Ontotext), Stanislav Zlatinov.
Todo:
Explain the Jape code.
Add my articles.
Quick links:
The Semantic Annotation Workflow - KIM part 10
KIM Multi-threaded Clustered Client Application - KIM part 9
Gazetteers - KIM/GATE part 7
Strict Rules vs Machine Learning - KIM part 6
Tips and Tricks - KIM part 5
Using a Gate application - KIM part 4
Gate tutorial - KIM part 3
Using KIM from .NET - KIM part 2
Getting Started - KIM part 1
Installation - KIM part 0






