The "Corpus Quality Assurance" tool and Annotation Transfer processing resource make it easy to compare the results of two pipelines.
You won't find this tool in the main
GATE "Tools" menu, but at the bottom of the GATE interface when you click on a corpus. This tool is available since GATE 5.1(I think). In order to use the Corpus Quality Assurance tool we need two annotations sets so that we can compare them.
1. Create a datastore
Create a corpus. Then right click it and use the "Save to datastore" option and point it to your datastore. Next populate the corpus and see how the documents do not appear as usually under the "Language resources" tab in your corpus, but in the datastore. Datastores are quite useful as they do not keep all the documents in memory. This way you can annotate a large number of documents before you encounter a problem in
GATE or JAVA.
2. Create Annotation Set 1
Warning: Check which version of GATE are you using and what are the plug-ins loaded. GATE can load plug-ins from other GATE versions and instances that you have installed.
You might for example want to test against two versions of GATE. But probably the second GATE will load the same list of plug-ins as the one you have just closed. Yes, the core of the GATE framework will be different, but because you load the same plug-ins, you will probably get the same results. So you might end up comparing the same results or wondering why the results are not the one you expect. Now go to the File->Manage CREOLE Plugins and do the job of selecting your resources explicitly checking the paths to your plugins.
Add a new "Annotation Set Transfer" processing resource to the end of your pipe-line:
inputASName: should be blank to use the default annotation set.
outputASName: will be in our case "Annotation Set 1".
For more information see the
documentation.
Now run the pipeline. You should find the new annotation set below the default one. If needed - collapse the default one to see the new one below it.
This new annotation set "Annotation Set 1" has been saved in the datastore automatically.
2. Make changes
Replace plug-in versions or load another instance of GATE. Please check the plug-in versions by inspecting their paths in File->Manage CREOLE Plugins.
3. Create Annotation Set 2
Open the datastore created in step 1(if it is not already available).
Open the corpus contained in that datastore (if it is not already available) by clicking the datastore, scrolling down until you find it, then right click and select "Load". It should now be available in its usual place under the "Language resources".
Check if "Annotation Set 1" is available.
Create "Annotation Set 2" by again adding "Annotation Set Transfer" processing resource and setting the correct parameters. Modify the "setsToKeep" parameter in "Document Reset PR" to "Annotation Set 1" in order to preserve the results from "Annotation Set 1".
Warning: If you get a strange exception it could be due to the Annotation Set Transfer resource. The version in GATE 5.2 trunk requires you to fill the "annotationTypes" list, although this is not a required by specification parameter.
Run the pipeline!
Check if "Annotation Set 2" is available.
4. Inspect
Click on the corpus and then on the "Corpus Quality Assurance" at the bottom. Wait until it initializes, there is a progress bar on the right. This initialization is need so that the tool knows what are you annotations types and features so it can fill the lists on the right.
Select "Annotation Set 1" and "Annotation Set 2" as A and B in the first list. Then select the annotation type that you would like to compare. I suggest that you select a single annotation type per comparison. Then select the features by which you think two annotations are identical. Select measures and then press the "Compare" button.
Warning: The feature "matches" contains the IDs of these annotations that refer to the same entity. A problem arises from the fact that these IDs are different in "Annotation Set 1" and "Annotation Set 2". If you check the document you will see that they refer to the same entities in the document, they are also different by a constant, but they are still different. We need to solve this problem because difference in matches can be indeed a real problem, not simply a difference in IDs. This is because GATE assigns bigger and bigger IDs. One way to go is not to use "matches", but another feature that for example contains the start and end point of an annotation in the document. These two should match when processing the same document and encoding is set correctly.
Click on "Document statistics". Then click on a document where "Only A", "Only B" or "Overlap" are greater then 0. This means we have a difference. To see the actual mismatch - click on the document and then click on the "Annotation Diff" button which is currently the second one on the right top in the GATE UI.
Keep in mind that there are might be some bugs as
this one.