Distributed with the taxonomy_xml module is a collection of starter vocabularies intended to both illustrate the various formats, and provide a few useful topic sets.
The content of each of the demo vocabularies was the
responsibility of the original publishers at the time it was
imported. All imports were done in a semi-automated manner
with no editorial input. I am not responsible for errors of
fact or spelling.
Structural problems, Character encoding problems and the
occasional ommissionare probably my fault.
Caveat Lector
Credit is given here to the institutions that made this data
available. All data redistributed here has carefully been
selected as being free for copyright-free transformative
re-use.
In some cases, tools or instructions will also be
provided for you to import your own versions of vocabulary
libraries for reasons of either scale, timeliness or
copyright. In cases of copyright you should read and
understand the terms of use of those respective data sources.
Usually it's "free for personal use but not redistribution"
and the taxonomy_xml module can enable that use.
Although the ownership on the Dewey Decimal system is
claimed by OCLC - Online
Computer Library Center they don't actually provide any
list (or offer access to a list) as a machine-readable
download, so I was unable to use them as a source.
Instead I found a public library
website that provided the Dewey lists into the Public
Domain. (Since gone away)
As samples, the taxonomy_xml module contains both a
100-term and 1000-term* version of the Dewey classification
scheme, with the implied decimal heirarchy and the 'Dewey
Number' supplied as a synonym.
As the Dewey system is extremely simple, it is provided as
an example of the CSV format.
Geography & history (900) + History of ancient world (930) + + History of ancient world China (931) + + History of ancient world Egypt (932) + + History of ancient world Europe north & west of Italy (936) + + History of ancient world Greece (938)* There's not really 1000 terms in use at that level. There are however many more subsections on a truly decimal breakdown in some areas (not included).
From the International Press Telecommunications Council we have a 'TopicSet' of 1365 controlled vocabulary words and phrases (subjectCodes) useful for classifying news stories and tagging media releases.
Subject areas include branches like:
The taxonomy is hierarchical, and contains full-text descriptions of each terms and a UID number provided by the IPTC. It does not contain synonyms or related terms (although it probably should).
unrest, conflicts and war + act of terror + armed conflict + civil unrest + + political dissent + + rebellions + + religious conflict + + revolutions
This data was imported by way of an XSL transformation from an XML file topicset.iptc-subjectcode.xml taken from the site in 2007. The IPTC also maintains several other useful vocabularies on their (hard to bookmark) Resource page. Visit them for more.
The E-government Initiative from the New Zealand government has produced the NZGLS thesauri - including a list of 2364 keyword-type ratified terms to be used when classifying government services or interest areas. It is only lightly hierarchical, and exists mainly as a synonym collapser and list of 'preferred' consistent terminology.
It contains many 'related terms' as well as several weaker synonyms for many terms.
Aircraft (Related Terms: Pilots, Aviation) (Synonyms: Light aircraft, Airships, Aeroplanes) + Helicopters + Microlite Aircraft Airlines (Related Terms: Aviation)
This data is currently being retrieved directly from the e.govt.nz website as a demonstration of the simplest kind of web service the taxonomy_xml module supports. The original file is provided as a CSV which is retrieved directly from the URL when the taxonomy_xml admin selects [Web Service][SONZ] as an import source.
This dataset is in fact the first test case, and the reason I started developing syntax readers for Drupal Taxonomies
This is a copy of a subset of the Google merchant recommended product category labels. The full thing is documented and downloadable from the Google Merchant Centre Help Pages
The distributed version contains only the top two levels (200 terms). The full thing - which you can download, convert to CSV and import yourself - can go to 5 levels deep and contain close to 4000 terms.
This is an alternate CSV format, taking each term on a new line with its ancestors repeated in each previous column.
Media, Media, Books Media, Books, Fiction Media, Books, Non-fiction Media, DVDs & Videos Media, Magazines & Newspapers Media, Music Media, Sheet Music
...etc, It's very limited (and wordy), but also about as obvious as possible.
This format was used by google base for its merchant product taxonomy, and represents the terms it wants to see in product descriptions. It could serve as a start for organizing an ecommerce store.
Top-level headings are:Animals Arts & Entertainment Baby & Toddler Business & Industrial Cameras & Optics Clothing & Accessories Electronics Food, Beverages & Tobacco Furniture Hardware Health & Beauty Home & Garden Luggage Mature Media Office Supplies Software Sporting Goods Toys & Games Vehicles & Parts