Validating Subject Terms Against the AGROVOC REST API

AGROVOC is a controlled vocabulary covering all areas of interest of the Food and Agriculture Organization (FAO) of the United Nations, including food, nutrition, agriculture, fisheries, forestry, environment etc. It is published by FAO and edited by a community of experts¹. At the time of this writing AGROVOC consists of over 36,000 concepts and is available in thirty-three languages. This broad scope makes it useful for sharing and preserving agricultural knowledge internationally.

We use AGROVOC as the de facto vocabulary for the Dublin Core subject field (dc.subject) in our CGSpace repository. Over the past ten years we have collected over 19,000 unique terms in the dc.subject field, though simple inspection of those values shows that the quality of the terms is mediocre. I wrote a Python script to programmatically validate these terms against the AGROVOC REST API.

AGROVOC Web Services

AGROVOC offers legacy SOAP web services, a SPARQL endpoint, and a REST API endpoint. In my brief evaluation SOAP seemed overly complex — what the hell is a WSDL file? — and the cognitive load of SPARQL is really only worth it if you need linked open data (RDF) support. The REST API is much easier to understand and the tools and documentation required to automate its interrogation are vastly more accessible.

agrovoc-lookup.py

The result of a few hours of work is agrovoc-lookup.py. Written for Python 3.6+ with minimal third-party dependencies, the script reads a plaintext input file line by line and asks AGROVOC if there is an exact match for each term in a given language. Matched and unmatched terms will be saved to separate output files for subsequent processing — you might want to try validating the unmatched terms against a different language, for example.

Given input file agrovoc-terms.txt with the following mix of English and Spanish terms…

CORN
WOMEN'S PARTICIPATION
COMMUNITY-BASED FOREST MANAGEMENT
INTERACCIÓN GENOTIPO AMBIENTE
COCOA (PLANT)

I run the script in debug mode against the English language AGROVOC, specifying output files for the matched and unmatched terms accordingly:

$ ./agrovoc-lookup.py -i agrovoc-terms.txt -l en \
-om matches.en.txt \
-or rejects.en.txt -d
Looking up the subject: CORN (en)
No exact match for 'CORN' in AGROVOC en
Looking up the subject: WOMEN'S PARTICIPATION (en)
Exact match for "WOMEN'S PARTICIPATION" in AGROVOC en
Looking up the subject: COMMUNITY-BASED FOREST MANAGEMENT (en)
No exact match for 'COMMUNITY-BASED FOREST MANAGEMENT' in AGROVOC en
Looking up the subject: INTERACCIÓN GENOTIPO AMBIENTE (en)
No exact match for 'INTERACCIÓN GENOTIPO AMBIENTE' in AGROVOC en
Looking up the subject: COCOA (PLANT) (en)
Exact match for 'COCOA (PLANT)' in AGROVOC en

The output files contain the matched and unmatched subject terms, respectively:

$ wc -l matches.en.txt rejects.en.txt
  2 matches.en.txt
  3 rejects.en.txt
  5 total

Because I know that our data has many Spanish and French subject terms, I would then run the rejected English terms against AGROVOC for each additional language, feeding the rejects from one into the input for the next. In the end this would produce one file of terms that didn’t match in any of the languages we are interested in, which I could then send to my editors for manual validation and correction.

Run the script with --help to see all options and their descriptions:

$ ./agrovoc-lookup.py --help

Future Areas of Improvement

As of this writing agrovoc-lookup.py has reached version 0.2.1. This reflects basic validation functionality with a few minor improvements that I made during my testing. For example, the script uses requests-cache to save time and (server) resources on subsequent runs with the same data, and I’ve improved the handling of subject terms formatted in at least English, Spanish, and French.

Possible areas of improvement include:

  • More output formats, like as a CSV with respective columns for matches and rejects
  • Suggestions for unmatched terms, possibly in one column of the CSV output
  • Ability to read from stdin / write to stdout for use in shell scripts
  • Ability to match in any language, ie without specifying -l
  • Handling terms in non-Latin languages like Russian, Georgian, Arabic, Chinese, etc

Comments, questions, and suggestions welcome. I hope this information is useful to someone. Pull requests are welcome on our DSpace repository on GitHub (where the script lives).

Footnotes

¹ AGROVOC (April 6, 2019). Retrieved from http://aims.fao.org/vest-registry/vocabularies/agrovoc.