AGROVOC is a controlled vocabulary covering all areas of interest of the Food and Agriculture Organization (FAO) of the United Nations, including food, nutrition, agriculture, fisheries, forestry, environment etc. It is published by FAO and edited by a community of experts ¹. At the time of this writing AGROVOC consists of over 36,000 concepts and is available in thirty-three languages. This broad scope makes it useful for sharing and preserving agricultural knowledge internationally.
We use AGROVOC as the de facto vocabulary for the Dublin Core subject field (
dc.subject) in our CGSpace repository. Over the past ten years we have collected over 19,000 unique terms in the
dc.subject field, though simple inspection of those values shows that the quality of the terms is mediocre. I wrote a Python script to programmatically validate these terms against the AGROVOC REST API.
AGROVOC Web Services
AGROVOC offers legacy SOAP web services, a SPARQL endpoint, and a REST API endpoint. In my brief evaluation SOAP seemed overly complex — what the hell is a WSDL file? — and the cognitive load of SPARQL is really only worth it if you need linked open data (RDF) support. The REST API is much easier to understand and the tools and documentation required to automate its interrogation are vastly more accessible.
The result of a few hours of work is
agrovoc-lookup.py. Written for Python 3.6+ with minimal third-party dependencies, the script reads a plaintext input file line by line and asks AGROVOC if there is an exact match for each term in a given language. Matched and unmatched terms will be saved to separate output files for subsequent processing — you might want to try validating the unmatched terms against a different language, for example.
Given input file
agrovoc-terms.txt with the following mix of English and Spanish terms…
CORN WOMEN'S PARTICIPATION COMMUNITY-BASED FOREST MANAGEMENT INTERACCIÓN GENOTIPO AMBIENTE COCOA (PLANT)
I run the script in debug mode against the English language AGROVOC, specifying output files for the matched and unmatched terms accordingly:
$ ./agrovoc-lookup.py -i agrovoc-terms.txt -l en \ -om matches.en.txt \ -or rejects.en.txt -d Looking up the subject: CORN (en) No exact match for 'CORN' in AGROVOC en Looking up the subject: WOMEN'S PARTICIPATION (en) Exact match for "WOMEN'S PARTICIPATION" in AGROVOC en Looking up the subject: COMMUNITY-BASED FOREST MANAGEMENT (en) No exact match for 'COMMUNITY-BASED FOREST MANAGEMENT' in AGROVOC en Looking up the subject: INTERACCIÓN GENOTIPO AMBIENTE (en) No exact match for 'INTERACCIÓN GENOTIPO AMBIENTE' in AGROVOC en Looking up the subject: COCOA (PLANT) (en) Exact match for 'COCOA (PLANT)' in AGROVOC en
The output files contain the matched and unmatched subject terms, respectively:
$ wc -l matches.en.txt rejects.en.txt 2 matches.en.txt 3 rejects.en.txt 5 total
Because I know that our data has many Spanish and French subject terms, I would then run the rejected English terms against AGROVOC for each additional language, feeding the rejects from one into the input for the next. In the end this would produce one file of terms that didn’t match in any of the languages we are interested in, which I could then send to my editors for manual validation and correction.
Run the script with
--help to see all options and their descriptions:
$ ./agrovoc-lookup.py --help
Future Areas of Improvement
As of this writing
agrovoc-lookup.py has reached version 0.2.1. This reflects basic validation functionality with a few minor improvements that I made during my testing. For example, the script uses
requests-cache to save time and (server) resources on subsequent runs with the same data, and I’ve improved the handling of subject terms formatted in at least English, Spanish, and French.
Possible areas of improvement include:
- More output formats, like as a CSV with respective columns for matches and rejects
- Suggestions for unmatched terms, possibly in one column of the CSV output
- Ability to read from
stdin/ write to
stdoutfor use in shell scripts
- Ability to match in any language, ie without specifying
- Handling terms in non-Latin languages like Russian, Georgian, Arabic, Chinese, etc
Comments, questions, and suggestions welcome. I hope this information is useful to someone. Pull requests are welcome on our DSpace repository on GitHub (where the script lives).
¹ AGROVOC (April 6, 2019). Retrieved from http://aims.fao.org/vest-registry/vocabularies/agrovoc.