Introduction and partners:
“A SOLID start for communities” : This blog post introduces a collaboration between the WIMMICS research team and the company Startin’blox about searching and querying data in a SOLID ecosystem. With SOLID, data can be distributed over multiple storages (PODs) accross multiple servers. In such a scenario, finding a piece of information becomes complex and might lead to excessive response time or resource consumption. Our work will be to study how we can reduce limiting factors to improve the performances.
WIMMICS is a joint research team between Inria, the Université Côte d’Azur and the CNRS (I3S). Its researchers are interested in the representation and processing of knowledge graphs, particularly on the Web.
The collaboration also involves the Mycelium and the Data Food Consortium projects. Both are working to make the short supply chain more efficient. These projects want to empower short supply chain actors on a digital level giving them the ability to build a whole suite of tools they control. Solid is a great choice as it can give back users the control over their data and applications through interoperability based on open standards.
Detailed project :
The objective of this collaboration is the design and evaluation of methods for search, indexing and discovery of services and datasets within the Solid ecosystem. The Solid project, for SOcial LInked Data, launched in 2015 by Tim Berners-Lee and incubated at the W3C, proposes the specification of a new web application architecture allowing a complete decoupling between data storage and business applications. Thus, the massive deployment of applications respecting Solid standards would make it possible to re-establish decentralisation on the web and give users the possibility of keeping control of their data, in storages called PODs. At present, the project consists of a set of ten or so more or less advanced specifications and there is a very active community working on several implementations. However, some fields are not yet covered, such as the querying of these distributed data.
Our aim is to design and evaluate methods for searching and querying distributed data in a Solid ecosystem. The ability to perform advanced searches on large volumes of data with acceptable performance is one of the foundations of information flow and the construction of social applications. We are investigating possible solutions to build on top of the Solid architecture capabilities for service discovery and path-finding and access to distributed datasets, by standardising the search and filtering capabilities of PODs. We are considering SPARQL traversal or decentralised query approaches to design a pilot architecture that also meets the performance challenges, for example via cache or index systems. This would allow us to support the diffusion of the Solid ecosystem on a web scale and demonstrate the deployment of real applications based on Solid.
Existing tools and first development
Solid rely on two complementary protocols that both address interoperability for web applications: the client-to-server standard and a client-to-client standard. The first one defines a set of universal rules and acts as a frame for the second which focuses on domain specific concerns. Search and discovery of data might be part of one or another protocol or both.
When we think about searching and discovering data the first idea that arises is indexing. An index is like the contact application on our phone. Using this app, it is very easy to find the number of the person we want to call given its name. Without this kind of index, we would have to find the number by browsing every number in our phone until we find the good one. While this is a very long and tiring process for an human, a machine would be faster but it would still take too long for large amount of data, like a huge public directory for instance. Indexing is a solution to make searching and discovering much faster.
Currently the only way of doing indexing in Solid is through the client-to-client standards. There is no indexing rules defined in the scope of the Solid client-to-server standard. One of the reason might be that the way of indexing things can be very specific and complex. Indeed, think of one application that wants to index real estate ads using complex criteria and calculations. Indexing in the client-to-server area would require a universal solution able to express the wide variety of indexing possibilities and provide a mechanism to the applications to let them know how to use the different indexes that have been created.
The TypeIndex project is the only active proposal for indexing data, whatever their nature. It only allows, at this time, to index data by their type. With TypeIndex you can tell for instance where some of your contacts or some of your movies can be found on your storage. The proposal lets us provide a public type index that everybody can use and a private type index for restricted access use only (owner or granted agents).
Such are the starting points we intend to evaluate and extend in this collaboration with the goal of supporting traversal query processing of Solid ecosystems.
Authors: Maxime Lecoq-Gaillard, Pierre-Antoine Champin, Benoît Alessandroni, Fabien Gandon.