I recently had the luck to participate in the BabelNet Workshop that was organized by the European Commission, the Publication Office and the European Parliament, in Luxembourg on the 2nd and 3rd of March.
The workshop provided valuable insights and a technical guide tour of BabelNet: the largest online multilingual encyclopedic dictionar
y and lexicalized semantic network. and Babelfy: the multilingual state-of-the-art disambiguation and entity linking system based on BabelNet.
The intense two-day workshop program enabled participants from EU institutions, industry and academia to gain hands-on experience with the different ways to access and leverage BabelNet and Babelfy, to disambiguate text written in mixed languages using the so-called language-agnostic setting as well as to explore ways of linking other resources to BabelNet.
BabelNet’s huge potential for research and industry applications was discussed in a truly interactive atmosphere. Three case studies demonstrated how to make EU resources, namely IATE, EUROVOC and Euramis, more effective by linking them to BabelNet. Speakers from industry and academia gave interesting ‘Lightning talks’, explaining how they used BabelNet for improving text alignment, text classification, sentiment analysis and other areas.
A little background
BabelNet is a result of the 5-year MultiJEDI ERC Starting Grant (2011-2016), headed by Prof. Roberto Navigli of Sapienza University of Rome. The project has received funding from the European Union's specific program ‘Ideas’ under the 7th Framework Program. Its two main objectives were to create large-scale lexical resources for dozens of languages and enable multilingual text understanding.
BabelNet has been created by the automatic seamless integration of resources such as WordNet, Wikipedia, Wikidata, Wiktionary, OmegaWiki and GeoNames, and the use of statistical machine translation (SMT) to acquire a large amount of multilingual concept lexicalizations. It currently covers 14 million concepts and named entities (NEs) lexicalized in 271 languages. The multilingual lexicalizations (i.e. words) are grouped into sets of synonyms called Babel synsets.
BabelNet is fully integrated with Babelfy, which is, more specifically, a unified, multilingual graph-based approach to multilingual entity linking and word sense disambiguation, as well as with Wikipedia Bitaxonomy, a state-of-the-art taxonomy of Wikipedia pages aligned to a taxonomy of Wikipedia categories. The BabelNet team is currently working on linking concepts to the 30 domains (topics) of the Wikipedia featured-articles plus a few more add-on domains, such as ‘Fashion’.
Access to knowledge
BabelNet and Babelfy can be accessed and queried online either from a browser or programmatically:
- Through BabelNet’s web interface, users can search for a term and get multilingual results in any available language, the Babel synsets (set of synonyms), the synset ID, Babel senses, synset domain, definitions, pictures, the part of speech. By clicking on a specific result the user gets related concepts, pronunciation, source information, usage examples, definitions/glosses, hypernyms and so on. Users can also explore the network through a fancy dynamic graph.
- Programmatically, users can access BabelNet and Babelfy through an HTTP API and a Java API for each, plus SPARQL (for BabelNet only). The credit philosophy called Babelcoins is deployed. Anyone can register and experiment with it. The free basic service offers 1K Babelcoins (equal to 1K queries) and this amount can be increased upon request. This means that anyone (be it an organization or an individual) who wants to experiment and understand how the BabelNet service works can do it for free.
The latest version of BabelNet (3.6) is also a knowledge base, offering semantic relations from linked resources and information extraction techniques, domain labels for millions of synsets and phrases and collocations for most of them. For commercial purposes the way to go is to contact the BabelNet team in order to exploit the best ways to leverage the resource and collaborate for a win-win. BabelNet and Babelfy can also be licensed offline for research purposes only. Both BabelNet and Babelfy resources and their APIs are made available under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported license.
BabelNet is also provided as a Linked Data Interface as part of the Linguistic Linked Open Data Cloud (LLOD). Prof. Asunción Gómez-Pérez of Technical University of Madrid (UPM) discussed Linked Data as a source of large background knowledge for NLP. More specifically, Gómez-Pérez discussed the need and benefit of connecting licensed language resources as well as of establishing a license for the link between resources. She said that closed resources can also be linked, in which case the owner negotiates with the interested party regarding accessibility issues (price, license, etc.).
There are different policies to govern conditional access to Linked Data. For example access may be provided for a price, either for the whole set or per triple ( i.e. statement in ‘subject/predicate/object’ form). The metadata of the dataset is kept in the DataHub, the content is in the LLOD while the actual datasets are kept by the owners.
The uses and users of Linked Data are: programmers, that are able to build applications making queries in SPARQL and get RDF, citizens/users that can access it through a user interface and machine-machine that perform data exchange and semantic interoperability in RDF.
Industrial application areas of BabelNet
In one of his presentations, Navigli discussed some industrial applications of BabelNet:
- Concept and Named Entity Extraction: For companies interested in extracting knowledge (concepts & NEs) from their document base in any language and making the results comparable across languages.
- Dictionary of the Future: For producers of LRs – Today’s dictionaries are islands with no bridge: BabelNet offers a way to bridge the unconnected or non-profit line (dictionary) islands and integrate them seamlessly into the wide-coverage multilingual semantic network, with labeled relations, pictures and multilingual synsets. All information will be integrated in a single entry.
- New publishing initiatives: For producers of educational and technical multimedia content – Annotate text with concepts, definitions and images and provide multilingual content.
- Multilingual Text Analysis: For all content producers (media, etc.) – Provide multilingual semantic analytics of text in any language for both concepts and NEs.
- Computer-Assisted Translation: For producers of CAT software – Assist in text alignment and cater for misalignment issues.
Navigli said that BabelNet has a lot of potential for machine translation (MT). The MT community is heavily focused in statistical methods and has not shown interest so far in how BabelNet could also contribute to improve machine translation quality. Navigli mentioned that, if such interest does not rise in the near future, he is tempted himself to prove BabelNet’s potential in MT, although MT is not his field.
Content producers and data owners may wonder if linking their data to BabelNet would mean sharing them publicly. Navigli confirmed that they don’t have to, unless they want to. Both parties can benefit from the resource linking in various ways.
On the one hand, BabelNet, can benefit by enhancing its knowledge base coverage, by using the data to improve disambiguation techniques or by introducing a BabelNet Plus version for enhanced services and so on. On the other hand, content producers and data owners can use their linked resources as they want. For example, by linking a translation memory (TM) resource to BabelNet such as the TAUS Data Cloud, the synsets and concepts could be used for quality assessment of translations. Each TM segment would be linked to specific synsets and concepts and the segment itself would be the context for disambiguation. An algorithm comparing the matching scores and the distance of synsets in the graph could judge if the source and the target are a (good) translation of each other.
Andrzej Zydron, CTO of XTM International, said that they are mainly using BabelNet to create large bilingual dictionaries on the fly in order to address alignment problems such as the runaway problem i.e. when something is misaligned and the rest of the document goes wrong. They have licensed BabelNet for 50 languages. He mentioned that it takes about one hour to produce large dictionaries between any languages.
BabelNet Annotation Group (BANG)
Navigli gave an introduction to BANG, a collaborative experiment that has no precedent due to the type of annotation (Wiktionary is very different). BabelNet users registered on the website can access blocks of synsets to be annotated in their native language. The data will be made openly available with a Creative Commons free license. The incentives for users to perform such annotations are to contribute to achieving a larger coverage of their own language, to contribute pictures for difficult and abstract concepts which can be a fun process and so on.
BabelNet’s future steps
With the completion of the MultiJEDI project, Roberto Navigli is in the process of founding the Sapienza startup company Babelscape to take over and exploit the project’s outcomes: BabelNet and Babelfy. He welcomes interested parties to contact him with ideas on how they can collaborate and leverage BabelNet.
We sure look forward to the release of BabelNet live, which will be updated every day or every week. This will be a crucial improvement of the next versions, as any update will be (almost) immediately available to BabelNet users, especially with regard to the Named Entities for disambiguation purposes.
For more information about the event and access to all presentations you can visit the The Luxembourg BabelNet Workshop site. An interesting article by Andrew Joscelyne on BabelNet - How the World Can Help Disambiguate Words was published on TAUS Review #3 in April 2015.