Describing the currently supported taxonomy formats found in the wild.
Note that not all formats are required to provide both an
import and export function. CSV and TCS are currently one-way
imports.
Note also that the Drupal-XML format is supported as legacy
only, and its use for export is discouraged. RDF is the recommended format for
maximum compatibility with other systems into the future.
DEPRECATED The basic format for taxonomy files is a custom-made XML schema reflecting the internal data objects of Drupal vocabulary terms pretty directly. It's suitable only for exchanging taxonomies between similar versions of Drupal sites, and not recommended for exporting to other systems. It is supported because a large function of this module is to assist migration from older sites, but should not be used as a recommended representation.
An XML schema taxonomy.xsd is available for validation. A snippet looks something like:
<vocabulary> <vid>5</vid> <name>Editorial sections</name> <hierarchy>1</hierarchy> <nodes>blog,page,story</nodes> <term> <tid>83</tid> <vid>5</vid> <name>Analysis</name> <description>Examines the connections between known facts.</description> <parent>0</parent> </term> <term> ... </term> </vocabulary>
For compatibility with the widest range of sources, CSV
import is possible.
See ISO 2788 for notes on expressing thesauri.
Flat-file taxonomies (or "thesauri", or "restricted
vocabularies") are often notated in files looking something
like:
Cyclones, Use, Storms Disasters, Used for, Natural disasters Storms, Used for, Cyclones Storms, Broader Terms, Weather Storms, Related Terms, Disasters Tidal waves, Use, Tsunami Tsunami, Used for, Tidal waves Tsunami, Broader Terms, Disasters Tsunami, Related Terms, Oceans Weather, Narrower Terms, Rain Weather, Narrower Terms, Storms Weather, Narrower Terms, Wind Weather, Related Terms, MeteorologyThis (incomplete) set of data would produce a taxonomy model looking like:
-- Disasters (syn: Natural Disasters; rel: Storms) ---- Tsunami (syn: Tidal Waves; rel: Oceans -- Weather (syn: Meteorology) ---- Storms (rel: Disasters, syn: Cyclones) ---- Rain ---- Wind
The shape of these files is pretty similar from many
sources, however the terminology used varies
widely.
"Related Term" is sometimes written as ['Related', 'RT',
'seeAlso'];
The same applies to all the other concepts.
Imports from CSV attempt to use any of these synonyms, so
it doesn't actually matter which words you use! See
taxonomy_xml.module:taxonomy_xml_relationship_synonyms()
for the full list. There is no requirement about source
order (you can refer to terms before they have been
'declared') and there is no requirement for internal
consistency. You can declare one term a parent of another,
that one a child of the first, with a statement either way,
or both.
A quick way to prototype up a taxonomy would be to create it in a text file with a term on each line, listing only the parent (or "Broader Term") to simply define an extensive hierarchy. If you are attempting to import from other sources, it should be possible to massage the data into a spreadsheet that can save a CSV looking something like this.
CSV format is only supported for import. No export is yet available. A much simpler (less powerful) module project was taxonomy_csv.module ... only mentioned for historical/comparison reasons.
This is an alternate Comma-Separated-Value format, taking each term on a new line with its ancestors repeated in each previous column.
Media, Media, Books Media, Books, Fiction Media, Books, Non-fiction Media, DVDs & Videos Media, Magazines & Newspapers Media, Music Media, Sheet Music
...etc, It's very limited (and wordy), but also about as obvious as possible.
This format was used by google base for its merchant product taxonomy, and has been suggested as a primitive format before now. It's not encouraged, but is one of the lowest-common-denominator ways of getting a heirarchy into the system. It is not supported for export.
For interchange with the newer information methodologies on the web, RDF is the preferred syntax. Although it's very verbose, and much harder for humans to read, it has many advantages when it comes to data interchange over the web, including
The dialect of RDF used in this module (even within this strict schema there are markup variations possible) is intended to resemble the (non-normative) examples found in the W3C recommendations, specifically the sample Wine Ontology [RDF].
In practice, this means the following attributes are used to define a taxonomy term.
Several other dialects are supported for input, (eg skos:Concept, skos:broader, skos:narrower) But are not serialized for output.
Remember, RDF requires an extra library.
RDF can use slightly different ways of expressing a similar concept, Other target input sources include :
For RDF input parsing, we use a GPL library, ARC from appmosphere arc.semsol.org. This is PHP4 compatible. RDF processing is seldom efficient, either in memory or time, so there may be difficulties with large imports.
For RDF output creation, we use PHP5
XML/DOM functions this makes RDF output incompatible
with PHP4, which had very flaky XML functions. If you
are trying these use the Web2.0 features, you really must
upgrade from the (now officially unsupported) PHP4, as this
legacy support would drag development down.
So the situation is, older unpatched servers can take
advantage of the distributed RDF vocabularies, but cannot
easily distribute their own.