Bias in the Web

Ricardo Baeza-Yates

Auditorio Ramon Picarte, Beauchef 851, 3er Piso edificio Norte, DCC, Universidad de Chile.

12:00 24/07/2017

The Web is the most powerful communication medium and the largest public data repository that humankind has created. Its content ranges from great reference sources such as Wikipedia to ugly fake news. Indeed, social (digital) media is just an amplifying mirror of ourselves. Hence, the main challenge of search engines and other websites that rely on web data is to assess the quality of such data. However, as all people has their own biases, web content as well as our web interactions are tainted with many biases. Data bias includes redundancy and spam, while interaction bias includes activity and presentation bias. In addition, sometimes algorithms add bias, particularly in the context of search and recommendation systems. As bias generates bias, we stress the importance of debiasing data as well as using the context and other techniques such as explore & exploit, to break the filter bubble. The main goal of this talk is to make people aware of the different biases that affect all of us on the Web. Awareness is the first step to be able to fight and reduce the vicious cycle of bias.

The complexity of reverse engineering problems for conjunctive queries

Reverse engineering problems for conjunctive queries (CQs), such as query by example (QBE) or definability, take a set of user examples and convert them into an explanatory CQ. Despite their importance, the complexity of these problems is prohibitively high (coNEXPTIME-complete). We isolate their two main sources of complexity and propose relaxations of them that reduce the complexity while having meaningful theoretical interpretations. The first relaxation is based on the idea of using existential pebble games for approximating homomorphism tests. We show that this characterizes QBE/definability for CQs up to treewidth k while reducing the complexity to EXPTIME. As a side result, we obtain that the complexity of the QBE/definability problems for CQs of treewidth k is EXPTIME-complete for each k≥1. The second relaxation is based on the idea of "desynchronizing" direct products, which characterizes QBE/definability for unions of CQs and reduces the complexity to coNP. The combination of these two relaxations yields tractability for QBE and characterizes it in terms of unions of CQs of treewidth at most k. We also study the complexity of these problems for conjunctive regular path queries over graph databases, showing them to be no more difficult than for CQs.

Efficient Computation of Certain Answers: Breaking the CQ Barrier

Computing certain answers is the standard way of answering queries over incomplete data; it is also used in many applications such as data integration, data exchange, consistent query answering, ontology-based data access, etc. Unfortunately certain answers are often computationally expensive, and in most applications their complexity is intolerable if one goes beyond the class of conjunctive queries (CQs), or a slight extension thereof. However, high computational complexity does not yet mean one cannot approximate certain answers efficiently. In this talk we survey several recent results on finding such efficient and correct approximations, going significantly beyond CQs. We do so in a setting of databases with missing values, and first-order (relational calculus/algebra) queries. Even the class of queries where the standard database evaluation produces correct answers is larger than previously thought. When it comes to approximations, we present two schemes with good theoretical complexity. One of them also performs very well in practice, and restores correctness of SQL query evaluation on databases with nulls.

Efficient Query Processing Under Updates

Modern computing tasks such as real-time analytics require constant refresh of query results under high update rates. Incremental View Maintenance (IVM) approaches this problem by materializing results in order to avoid recomputations. IVM naturally induces a trade-off between the space needed to maintain the materialized results and the time used to process updates. In this talk we will discuss a new approach for evaluating queries under updates. Instead of the materialization of results, we aim to maintain a data structure featuring (1) efficient maintenance under updates, (2) constant-delay enumeration of the output, (3) constant-time lookups in the output, while (4) using only linear space in the size of the database. We will describe DYN, a dynamic version of the Yannakakis algorithm, and show that it yields such data structures for the class of free-connex acyclic marginalize-join queries. We show that this is optimal in the sense that such a data structure cannot exist for marginalize-join queries that are not free-connex acyclic. In addition, we identify a sub-class of queries for which DYN features constant-time update per tuple. We introduce a cost model for DYN under different join trees, and describe how to evaluate under updates the more general class of acyclic aggregate join queries. Finally, using the industry-standard benchmarks TPC-H and TPC-DS, we experimentally compare DYN and a higher-order IVM (HIVM) engine.

Cienciometría: Índices y mapas

La cienciometría es la disciplina que estudía la producción científica, con la finalidad de analizarla y cuantificarla. A partir de bases de datos de publicaciones, caracteriza a los investigadores, cuantifica sus resultados en términos de impacto, identifica patrones de colaboración y estudia la propagación de nuevo conocimiento. En esta charla se discute el uso de índices de impacto como Journal Citation Report y H-index, en cuanto a sus alcances y limitaciones. Se presenta además un método automático de construcción de mapas de la ciencia, denominado Research Space, el cual permite caracterizar la producción científica a distintos niveles de agregación (individuos, instituciones, países). Se muestran propiedades predictivas del Research Space y se discuten sus alcances y limitaciones.

The Multiset semantics of SPARQL

We will present and algebraic and logic semantics for the core of SPARQL, a fragment that correspond to Multiset Datalog with safe negation and Multiset Relational Algebra. We will show that one can leverage for multisets, the classic correspondence between logic, algebra and query languages existing for set semantics.

Statistical Query Algorithms for Stochastic Convex Optimization

Statistical query (SQ) algorithms are the class of learning algorithms that can be implemented using approximate expectations of any given function of the input distribution, as opposed to direct access to i.i.d. samples. This computational model has a number of applications, ranging from noise-tolerant learning to differential privacy, and it has been used to obtain unconditional lower bounds on conjectured hard problems over distributions. In this talk I will give an introduction to the theory of SQ algorithms, and we will see some recent developments in the case of stochastic convex optimization. Our main contribution is establishing nearly optimal SQ algorithms for mean vector estimation (this includes stochastic linear programming), which serves as a basis to obtain SQ versions of various gradient-type and polynomial time algorithms for stochastic convex optimization. Time permitting, I will show some consequences of our results for learning of halfspaces, differential privacy, and proving unconditional lower on the power of convex relaxations for random constraint satisfaction problems. This talk is based on joint work with Vitaly Feldman and Santosh Vempala, to appear in SODA 2017.

The interplay between word equations with regular constraints and graph querying with path comparisons

It is natural in some contexts to extend word equations with regular or rational constraints. Such constraints restrict the space of solutions, by imposing some pairs of variables to be witnessed by a pair of words in a given regular or rational language. As of late, word equations with such constraints have played a prominent role in understanding the complexity of evaluation for graph query languages with path comparisons. These languages find applications, e.g., in provenance, route-finding, social networks, and the semantic web. In this tutorial, we present several results generated in the interplay of word equations with rational constraints and graph querying. While word equations with rational constraints are, in general, undecidable, the focus on "practical" rational relations -- e.g., prefix, subsequence, or suffix -- that graph querying imposes force us to revisit such results and perform a more fine grained analysis of the complexity of such equations. On the other hand, we apply known techniques from word equations with constraints to obtain bounds for graph query evaluation that would have been difficult to obtain otherwise.

Querying Wikidata: Comparing SPARQL, Relational and Graph Databases

Wikidata is a new knowledge-base overseen by the Wikimedia foundation and collaboratively edited by a community of thousands of users. The goal of Wikidata is to provide a common interoperable source of factual information for Wikimedia projects, foremost of which is Wikipedia. In this talk, we present the result of our experiments that compare experimentally the efficiency of various database engines for the purposes of querying the Wikidata knowledge-base, which can be conceptualized as a directed edge-labelled graph where edges can be annotated with meta-information called qualifiers. We take two popular SPARQL databases (Virtuoso, Blazegraph), a popular relational database (PostgreSQL), and a popular graph database (Neo4J) for comparison and discuss various options as to how Wikidata can b e represented in the models of each engine. We design a set of experiments to test the relative query performance of these representations in the context of their respective engines. We first execute a large set of atomic lookups to establish a baseline performance for each test setting, and subsequently perform experiments on instances of more complex graph patterns based on real-world examples. We conclude with a summary of the strengths and limitations of the engines observed. The talk is bases on a paper made with Aidan Hogan, Christian Riveros, Carlos Rojas and Enzo Zerega, that will be presented in ISWC 2016.

Word transducers: from two-way to one-way

Regular word transductions extend the robust family of regular word languages, preserving many of its characterisations and algorithmic properties. Finite state transducers are a standard model for representing word transductions, and can be seen as automata extended with outputs. However, differently from automata, two-way transducers are strictly more expressive than one-way transducers. It has been recently shown how to decide if a two-way functional transducer has an equivalent one-way transducer, and the complexity of the algorithm is non-elementary. We will present an alternative and simpler characterisation that is decidable in EXPSPACE. In the positive case, the characterisation can be used to construct an equivalent one-way transducer of (worst-case optimal) doubly exponential size. We will finally discuss a generalisation of the result that characterises k-pass sweeping definability of transductions, and relate this to minimisation problems for registers of streaming transducers.

Tree Automata for Reasoning in Databases and Artificial Intelligence

In database management, one of the principal task is to optimize the queries to evaluate them efficiently. It is in particular the case for recursive queries for which their evaluation can lead to crawl all the database. In particular, one of the main question is to minimize the queries in order to avoid to evaluate useless parts of the query. The core theoretical question around this line of work is the problem of inclusion of a query in another. Interestedly, this question is related to an important question in IA which is to answer a query when the data is incomplete but rules are given to derive new information. This problem is called certain query answering. In both context, if both problem are undecidable in general, there are fragments based on guardedness that are decidable due to the fact there exists witness of the problems that have a bounded tree width and that their encoding in trees is regular. Furthermore, the queries can be translated in MSO. In both contexts, Courcelle’s Theorems imply the decidability of both problems. I will present to the different results on the translation of logic class of formula for our problems into tree automata to obtain tight bounds to the problems of inclusion of recursive queries or certain query answering.

BigData como Unificador de Información en el Estado

Es una práctica conocida perseguir la unificación de la información en una misma fuente, lo cual en un plano teórico parece muy simple en la realidad es un desafío enorme. La información en la realidad no es perfecta, los sistemas transaccionales no consiguen un 100% de efectividad en el control de variables, impactando directamente en la calidad de la información recopilada. Manipular millones de registros y bases con Gigas de Información es una problemática que hasta hace unos años se miraba con distancia, hoy es un problema real en el Estado y está irrumpiendo con fuerza en Chile. El Ministerio de Desarrollo Social es quien administra el Registro Social de Hogares, el instrumento de focalización vigente al día de hoy (antiguamente Ficha de Protección Social) y, al mes de Julio de 2016, éste registro contiene información de las características socioeconómicas de más de 4,5 millones de hogares y sobre 12 millones de personas a nivel nacional (70% de la población nacional). Esta base de datos, construida y actualizada permanentemente gracias al esfuerzo conjunto entre el Ministerio de Desarrollo Social y las 345 municipalidades del país, es relevante en el diseño y rediseño de prestaciones sociales que proveen distintas instituciones del Estado En esta charla se pretende mostrar cómo el Ministerio de Desarrollo Social utilizó su experiencia previa en la manipulación de bases estadísticas para la implementación exitosa de una plataforma BigData, los puntos críticos del proyecto, el equipo de trabajo y el impacto de la solución a nivel nacional y social.

A framework for annotating CSV-like data

In this presentation, we propose a simple and expressive framework for adding metadata to CSV documents and their noisy variants. The framework is based on annotating parts of the document that can be later used for selecting data. These expressions are then combined using a set of rules in order to annotate the data. We study the computational complexity of implementing our framework and present an efficient evaluation algorithm that runs in time proportional to its output and linear in its input. As a proof of concept, we test an implementation of our framework against a large number of real world datasets and show that it can be efficiently used in practice.

Big Data? Big Promise, Big Problems

The data that we record daily about ourselves through our cell phones, credit card purchases, emails, social media postings, etc., helps us connect with each other and improve our life quality. But we do not own nor control most of this data. I will discuss the good and the bad in this powerful information technology, including a number of cases – some inspiring, some terrifying – and what we can do as citizens to obtain the promise of big data while mitigating some of the worst problems. Among other things, I will trace problems with the release of "anonymized" data that is later "deanonymized".

Copyless cost register automata

Cost register automata (CRA) and its subclass, copyless CRA, were recently proposed by Alur et al. as a new model for computing functions over words. We study the expressiveness and closure properties of copyless CRA. In particular we compare this model with the well-know model of weighted automata. Weighted automata can be naturally characterized by a subfragment of weighted logic, a quantitative extension of MSO introduced by Droste and Gastin. Our results show that a similar characterization of copyless CRA seems to be unlikely. We introduce a new quantitative logic, which turns out to be equally expressive to a natural subclass of copyless CRA. Based on joint work with Cristian Riveros.

Acquiring and Exploiting Lexical Knowledge for Twitter Sentiment Analysis

This talk addresses the label sparsity problem for Twitter polarity classification by automatically building two type of resources that can be exploited when labelled data is scarce: opinion lexicons, which are lists of words labelled by sentiment, and synthetically labelled tweets. We build Twitter-specific opinion lexicons by training words-level classifiers using representations that exploit different sources of information such as (a) the morphological information conveyed by part-of-speech (POS) tags, (b) associations between words and the sentiment expressed in the tweets that contain them, and (c) distributional representations calculated from unlabelled tweets. Experimental results show that the generated lexicons produce significant improvements over existing manually annotated lexicons for tweet-level polarity classification. In the second part, we develop distant supervision methods for generating synthetic training data for Twitter polarity classification by exploiting unlabelled tweets and prior lexical knowledge. Positive and negative training instances are generated by averaging unlabelled tweets annotated according to a given polarity lexicon. We study different mechanisms for selecting the candidate tweets to be averaged. Our experimental results show that the training data generated by the proposed models produce classifiers that perform significantly better than classifiers trained from tweets annotated with emoticons, a popular distant supervision approach for Twitter sentiment analysis.

Extending Weakly-Sticky Datalog+-:Query-Answering Tractability and Optimizations

Weakly-sticky (WS ) Datalog is an expressive member of the family of Datalog programs that is based on the syntactic notions of stickiness and weak-acyclicity. Query answering over the WS programs has been investigated, but there is still much work to do on the design and implementation of practical query answering (QA) algorithms and their optimizations. Here, we study sticky and WS programs from the point of view of the behavior of the chase procedure, extending the stickiness property of the chase to that of generalized stickiness of the chase (gsch-property). With this property we specify the semantic class of GSCh programs, which includes sticky and WS programs, and other syntactic subclasses that we identify. In particular, we introduce joint-weakly-sticky (JWS) programs, that include WS programs. We also propose a bottom-up QA algorithm for a range of subclasses of GSCh. The algorithm runs in polynomial time (in data) for JWS programs. Unlike the WS class, JWS is closed under a general magic-sets rewriting procedure for the optimization of programs with existential rules. We apply the magic-sets rewriting in combination with the proposed QA algorithm for the optimization of QA over JWS programs.

Reconsidering REST: Do we need a new ontology for the Web?

In his PhD Fielding defined properties he asserted were what made the WWW work. The recent revision of the IETF HTTP standard retains much of his terminology and worldview. In this talk I argue that although a brilliant insight into the 1995 Web, it is now so out-of-date as to be seriously misleading: it actually makes the new standard harder to understand instead of easier. I'll present empirical evidence, based on two sets of caching proxy logs totalling 160M entries, to back up this claim.

Rebooting Peer Review Systems via Boot Strapping Databases?

The ``Peer Review'' system has been used in its current form for roughly one century. It has recently started to show its limits, from systemic failure such as the lack of repeatibility of scientific publication in various fields (Economics, Psychology, etc.), to isolated personal failures such as scientific frauds, and more generally to dangerous trends in the evaluation of academic careers, departments and universities, forming long term incentives potentially harmful to Human ingenuity and ability to progress scientifically. We postulate that such failures are a symptom of the growth of Academia, and that the pressure on the member of academia is stretching the current implementation of the peer review system beyond its limits. Accordingly, we list modern technologies, currently unused, which could be used for a new implementation of the peer-review process, we discuss the potential cost and advantages of each separately, and which ones and how they can be combined to form a soft transition to a new healthier peer review system. Among those techniques, we focus in particular on the concept of "Boot Strapping Databases", where all agents are taking turn at being author, referee and reader, and where the quality of one's good work as a referee at each tier is evaluated, and used as a condition to be allowed to participate as an author and as a reader. Such systems have been tested to correct student essays in "Calibrated Peer Review", and to incentivate the collaborative development of a database of teaching material at the University of Waterloo. We will open the discussion about how to introduce to manage the peer-reviewing of academic research via services such as the publishers Arxiv and the Cryptology ePrint Archive (

DASL: A Scala-based DSL for Graph Analytics on GPUs

For data intensive analytic challenges, memory bandwidth, not processor speed, is the primary performance limitation. Graphics processing units (GPUs) provide superior bandwidth to main memory and can deliver significant speedups over CPUs. However, it is not trivial to develop GPU accelerated graph algorithms. In contrast, to scale applications onto multicore, parallel architectures it requires significant expertise, including intimate knowledge of the CPU and GPU memory systems, and detailed knowledge of a GPU programming framework such as OpenCL or CUDA. To enable analytic experts to implement complex graph applications that efficiently run on GPUs we have developed a domain-specific language called DASL and a corresponding execution system that executes DASL programs by automatic translation into program code that is optimized for GPUs. In the talk I will introduce DASL based on a few simple examples, provide a technical overview on Blazegraph DASL, and discuss some aspects of the approach in more depth, including the translation of DASL statements and the integration of user defined functions.

Social media, information and behavior

Most of the available evidence suggests that there is a positive correlation between social media use and engagement in collective action, such as social movements and political protests. However, we do not know if this correlation results from a causal effect or, instead, merely reflects that participants happen to use social media more often. So long as the potential mechanisms by which online social networks may lead people to engage in collective action remain obscure, any causal attribution remains suspect. Furthermore, we know little about the communication flows that transpire in social media, including the conditions under which using Twitter, Facebook and similar platforms may influence citizens’ political behavior. Using a variety of quantitative and qualitative methods—including human and automated content analysis, cross-sectional and panel surveys, social network analysis, “small” and “big” data—I will present some of the work I’ve done mapping the functions, mechanisms and conditions under which social media relates to collective action in Chile.

Linking Open-world Knowledge Bases using Nonmonotonic Rules

Integrating knowledge from various sources is a recurring problem in Artificial Intelligence, often addressed by multi-context systems (MCSs). Existing MCSs however have limited support for the open-world semantics of knowledge bases (KBs) expressed in knowledge representation languages based on first-order logic. To address this problem we present knowledge base networks (KBNs), which consist of open-world KBs linked by non-monotonic bridge rules under a stable model semantics. Basic entailment in KBNs is decidable whenever it is in the individual KBs. In particular, for networks of KBs in well-known Description Logics (DLs), reasoning is reducible to reasoning in dl-programs of Eiter et al. As a by product, we obtain an embedding of Motik and Rosati’s hybrid MKNF KBs, which amount to a special case of KBNs, to dl-programs. We also show that reasoning in networks of ontologies in lightweight DLs is not harder than in answer set programming.

Designing a Flash Translation Layer

Although flash memory has risen as a popular secondary storage medium, its performance is difficult to predict. The reason is that internally, flash memory is subject to a complex set of constraints. Storage units must be erased before they are updated, erases have a bigger granularity than writes, and each storage unit has a limited lifetime in terms of erases. A flash translation layer (FTL) hides these constraints and exposes a simple block interface to the application. The problem is that the FTL's behind-the-scene work impacts performance. In this presentation, we show how to redesign the FTL so as to eliminate an important performance bottlenecks and thereby make the performance of flash devices more predictable.

Distributed Machine Learning in Yahoo Sponsored Search

Sponsored search consists in retrieving the "best'' advertisements that match a given search query. The term "best'' in this context has to take into account several dimensions, such as relevance, revenue and post-click quality. These dimensions, sadly, are usually poorly correlated. Therefore, retrieving an ad in response to a query is a problem that is much more difficult than retrieving a web search result where relevance is a dominant dimension. In this talk we overview how machine learning is used at Yahoo to improve our sponsored search. The term "distributed'' in this talk has a double meaning. First, we show how distributed representations help to solve the problem of retrieving ads in response to a query by coupling relevance filtering and retrieving ads taking in account the content and the context involved with click prediction. Second, we overview a distributed architecture that allows to compute these distributed representations much faster. This talk includes the work of many people at Yahoo Labs and Yahoo Platforms.

Data science for astronomy

In the last few years, there has been an increased interest towards computer applications for astronomical research. This interest is mainly triggered by the ongoing and future observational projects expected to deliver huge amounts of high-quality data, which needs to be promptly analyzed. Some illustrative examples are: the upcoming Large Synoptic Survey Telescope (LSST), the ongoing Vista Variables in the Via Láctea (VVV) ESO Public Survey, and the Atacama Large Millimeter/submillimeter Array (ALMA), among others. Besides the upcoming telescopes, today we have hundreds of astronomical catalogs of tens of millions of objects, available online. All this information makes the Astronomical analysis of stellar objects impossible without the help of computer systems. Last advances in data science, Machine Learning, and Statistics offer us very powerful solutions to explore Astronomical data, nevertheless, there are many challenges that we still need to overcome. In this talk I will present a summary of our main work in time series automatic classification, light-curve representation, automatic detection of specific types of stars, discovery of unknown classes of stellar objects and meta models for efficient and automatic integration of trained experts.

Learning Opinion Dynamics in Social Networks

Social media and social networking sites have become a global pinboard for exposition and discussion of news, topics, and ideas, where social media users increasingly form their opinion about a particular topic by learning information about it from her peers. In this context, whenever a user posts a message about a topic, we observe a noisy estimate of her current opinion about it but the influence the user may have on other users’ opinions is hidden. In this talks, we introduce SLANT, a probabilistic modeling framework of opinion dynamics, which allows the underlying opinion of a user to be modulated by those expressed by her neighbors over time. We then identify a set of conditions under which users’ opinions converge to a steady state, find a linear relation between the initial and steady state opinions, and develop an efficient estimation method to fit the model parameters from historical fine-grained opinion and information diffusion event data. Experiments on data gathered from Twitter, Reddit and Amazon show that our model provides a good fit to the data and more accurate predictions than alternatives.

Climate Research Infrastructure in Chile: Databases, modelling and the gap to Decision Making

The Center for Climate Research and Resilience (CR)2 focuses on Earth System Science, with an interdisciplinary approach that allows to better understand Climate and Ecological System, and societal resilience in Chile. In this talk, we will present the lifecycle of meteorological data that is used by (CR)2, the layers of aggregation applied to make it understandable by different end users, and its potential use in decision and policy making. In this context, getting access to sound local meteorological time series is critical for all research lines, and one of the main tasks still in progress, is to compile local observational databases as ancient as possible, and make them publicly available, in scientific format for different communities. This requires data recovery, compilation of existing data and some consistency checks. Other databases that are needed, are larger sets of gridded climatological data (which result from global and regional simulations both for past and future scenarios), satellite data, reanalisys products, as well as air or water quality related data. The sources are numerous, as are the formats and sizes that the data comes in. In parallel, the Center needs high performance computing environments, as well as large storage capacity, for the performance of climate simulations through earth system models. The main users of this data come from the scientific community itself, but generating statistics, analysis, and visualization products based on the this 'raw' data, allows to make climate information accessible and understandable to more end users. The ultimate step, is to incorporate climate scientific results in decision and policy making at different levels by means of Climate Services.

Schema Mappings and Data Examples: An Interplay between Syntax and Semantics

Schema mappings are high-level specifications that describe the relationship between two database schemas. Schema mappings have been explored in depth during the past decade and have turned out be the essential building blocks in data exchange and data integration. Since in real-life applications schema mappings can be quite complex, it is important to develop methods and tools for deriving and explaining schema mappings. A promising approach to this effect is to use ``good'' data examples that illustrate the schema mapping at hand. In this talk, we will present an overview of a body of work on characterizing and deriving schema mappings via a finite set of data examples, including characterizations of schema mappings via a finite set of universal examples and the learnability of schema mappings from data examples. Along the way, we will encounter tight connections between unique characterizability of schema mappings and homomorphism dualities in constraint satisfaction.

Forbidden vertices

In this talk, we introduce and study the forbidden-vertices problem. Given a bounded polyhedron P and a subset X of its vertices, we study the complexity of linear optimization over the subset of vertices of P that are not contained in X. This problem is closely related to finding the k-best basic solutions to a linear problem. We show that the complexity of the problem changes significantly depending on how both P and X are described, that is, on the encoding of the input data. For example, we analyze the case where the complete linear formulation of P is provided, as opposed to the case where P is given by an implicit description (to be defined in the talk.) When P has binary vertices only, we provide additional tractability results and linear formulations of polynomial size. Some applications and extensions to integral polytopes will be discussed. Joint work with Shabbir Ahmed, Santanu S. Dey, and Volker Kaibel.

A measure of centrality based on Random Walks

Random walks is a combinatorial notion that works great as a model of the flow of information in a graph. It has several applications in many domains, the most visible one being page rank. In this talk I will go through the basics of random walks, present a notion of centrality based on them, and show some of its properties.

Planning with Linear Temporal Logic Goals: Two Translation Approaches

Finite Linear Temporal Logic (fLTL) is a compelling language for representing goals in automated planning. The main approach for planning with these goals is to translate them into (regular) final-state goals, which are the standard supported by all highly engineered planning systems. In this talk I will present two of these translation approaches. The first one, which is worst-case exponential, builds an non-deterministic finite-state automaton for the fLTL formula, while the second builds an alternating automaton. The automata are then encoded in a standard planning language. Our empirical evaluation suggests that both of these approaches have strengths and weaknesses.

Dynamic Graph Queries

Graph databases in many applications---semantic web, transport or biological networks among others---are not only large, but also frequently modified. Evaluating graph queries in this dynamic context is a challenging task, as small changes in the database make old answers incomplete or incorrect. What’s more, the size of these applications makes reevaluation from scratch prohibitively expensive. It is conceivable, though, for answers to a query before and after small modifications to be closely related. One thus hopes to update the answer to a query in more a efficient way than recomputing it from scratch. Even more so if one stores auxiliary information that may ease the task, updating it accordingly. To what extent this is possible, and in which precise conditions, is the subject of dynamic computational complexity. In this talk I will go through the basics of the descriptive framework for dynamic complexity, where the goal is to keep the answer to a query on a input database updated by First Order queries (or core SQL) evaluated on the input database extended with auxiliary data. I'll exhibit the main features of dynamic programs on simple examples, with emphasis on queries for graphs, to then present preliminary results on dynamically maintaining traditional graph query languages (regular path queries and extensions thereof), pointing out interesting directions for further research (joint work with Nils Vortmeier and Thomas Zeume).

Fully Homomorphic Encryption: Blindfold Computations in the Cloud

Fully Homomorphic Encryption (FHE) schemes are probabilistic public key encryption schemes that allow anyone to compute functions of ciphertexts without any secret knowledge. For instance, given a FHE ciphertext C of a plaintext m, any user may learn another ciphertext C' which encrypts f(m), for a function f of his choice. This property has been referred to as the "Holy Grail of Cryptography" and it was first proved feasible using ideal lattices by C.Gentry in his 2009 Ph.D. thesis. We will first give an overview of his construction, and then jump in time to describe fancy properties of recent FHE schemes: identity or attribute based FHE schemes, Multikey FHE schemes, Key-homomorphic FHE schemes, and discuss some open problems and applications.

Visual-Semantic Graphs: Using Queries to Reduce the Semantic Gap in Web Image Retrieval

I will present the application of a graph representation to model similarity relationships that exist among images found on the Web. The resulting similarity-induced graph allows us to model in a unified way different types of content-based similarities, as well as semantic relationships. Content-based similarities include different image descriptors, and semantic similarities can include relevance user feedback from search engines. The goal of this representation is to provide an experimental framework for combining apparently unrelated metrics into a unique graph structure, which allows us to enhance the results of Web image retrieval. The approach was evaluated by re-ranking Web image search results.

Exploiting Semantic Analysis of Documents for the Domain User

Many document organization tasks, such as a student writing the related work chapter of a thesis, a professor surveying the state of the art in a proposal or planning a reading course, or a conference chair organizing sessions would be performed more efficiently through the use of document clustering. Fully unsupervised document clustering does not always yield clusters that are relevant to the user’s point of view. In this work, we pursue document clustering algorithms that allow the interactive engagement of the user in the clustering process. The main challenge is how to obtain useful clusters with minimum user effort. To address this challenge, we propose (1) a user-supervised double clustering algorithm, designed in three stages, and (2) a novel approach for mapping documents to entities and concepts. The user-supervised double clustering algorithm was demonstrated to be competitive to state-of-the-art clustering algorithms. It was further extended into an ensemble algorithm to incorporate Wikipedia concepts in the document representation. User supervision was introduced into these algorithms in the form of term supervision (term labelling) and document supervision. A visual interface was designed to make the algorithms accessible to real domain users. The work received the Best Student Paper award at ACM DocEng 2014. To address the problem of coming up with succinct and intuitive representations of documents in terms of entities and concepts, we have pursued two directions of research: (1) we designed a system that accomplishes entity recognition and disambiguation using the Wikipedia category structure in multiple languages. We are currently extending this system to concept recognition and disambiguation. Our system got the first prize in the ERD challenge at ACM SIGIR 2014; (2) we proposed a simple but very effective approach for computing semantic relatedness between words and documents based on the Google n-gram corpus, which is competitive to human performance on standard word pair data sets. The clustering work is joint with H. Nourashraf and D. Arnold, the ERD work with Marek Lipczak and Arash Koushkestani, and the Google n-gram based semantic relatedness with Aminul Islam and Vlado Keselj.

Understanding Real-World Events through Social Media

The enormous rise in the generation of content on the Web, has made it increasingly important to automatically organize and understand this information. With the unprecedented success of on-line social network platforms, such as Facebook, Twitter and Flickr, users have changed their roles from being only information consumers, to also becoming editors and publishers. In particular, Twitter as become a preferred source for breaking news to an important percentage of Web users. The streaming nature of this platform, facilitates real-time information dissemination, before its publication in traditional mass media. Moreover, not only objective facts regarding news are hared through social media, but also multiple opinions and points of view. Therefore, social media activity surrounding news events provides rich insight into human perception of world events. Nevertheless, the rapid flow of large data volumes in Twitter makes information volatile, producing the loss of potentially important historical data for users and for society in general as well. Following this motivation, in this presentation I will talk about the potential of user-generated content and social media data for modeling, summarizing and retrieving news events. I will present our current work in the direction of understanding events in social media as complex units of information, using sentiment, visualization and geo-temporal context.

Querying graph databases (a theoretician's perspective)

Several query languages for graph data have been considered in the past, ranging from relational approaches that supported extraction of attribute values, to regular path queries and their extensions that examine patterns connecting the nodes, but completely disregard the data they encounter. In this presentation we will introduce several languages that mix these two modes of querying thus allowing the user to explore how the actual data changes along paths and patterns connecting graph nodes. We will start with a brief discussion of the data model and show how different features can be coded in a simple theoretical model supporting values and labels. We then introduce a number of new query languages using ideas from automata theory and XML and examine some of their fundamental properties such as query evaluation and expressive power.


SPARQL is the official language for querying RDF datasets, and is recognized as one of the key technologies of the semantic web. This talks presents our study of SPARQL CONSTRUCT queries, that is, queries which output RDF graphs, and our advancement in a proposal for a general purpose recursion operator to be added to SPARQL. Contrary to other much more well studied SPARQL fragments, for CONSTRUCT queries we are able to show a clean connection to well-known logical and database formalisms, from first order logic and its positive fragment to data exchange settings. We use these connections to show other properties of CONSTRUCT queries, such as composability and complexity of evaluation. We use our framework to define a simple but powerful recursive operator to be added to SPARQL and to study its main properties. We finish this task by showing how to implement this operator as a plug-in on top of existing SPARQL engines and report its performance on real world datasets.

Characterizing tractable restrictions of query languages

One of the most fundamental problem associated to any query language is that of evaluation: Given a query Q, a database D, an a tuple t, decided whether t belongs to the answer of Q over D. As expected, this problem is computational intractable for most natural languages, for example, first-order queries and conjunctive queries. Thus a natural problem is to identify tractable restrictions of these languages. An even more challenging problem is to identify precisely all the tractable restrictions, that is, to characterize the tractable restrictions in terms of structural properties of queries. In this talk, we first review a known characterization of the tractable restrictions of conjunctive queries. Then we discuss how to apply these ideas to obtain new characterizations for well-designed SPARQL queries.

Epistemonikos, la base de Revisiones sistemáticas más grande del mundo

En medicina hoy es claro que hay que seguir un proceso llamado "medicina basada en la evidencia", que permite a los médicos y tomadores de decisión en salud decidir con la mejor información. Para ello, es clave contar un tipo de estudios llamados revisiones sistemáticas, que nos ayudan a responder preguntas como ¿sirve el medicamento X para la enfermedad Y?. Epistemonikos es la base de datos de revisiones sistemáticas en salud más grande del mundo. Contamos hoy con más de 55.000 revisiones sistemáticas, 350.000 artículos en total. Estos documentos están conectados, generando un gran grafo de información. En esta charla, se presentarán los principales desafíos que existen en la toma de decisión en salud, cuales de estos resuelve Epistemonikos y los desafíos técnicos que existen. Se dará acceso a los datos de Epistemonikos para los interesados.

Quantifying collective states from online social networking data

A significant fraction of humanity is now engaged in online social networking. Hundreds of millions of individuals publicly volunteer information on the most minute details of their experiences, thoughts, opinions, and feelings in real-time. This information is communicated largely in the form of very short texts, necessitating the development of text analysis algorithms that can determine psychological states from the content of sporadic and poorly formatted text updates. In this presentation I will provide an overview of the rapidly evolving domain of computational social science, which along with web science, is making significant contributions to our understanding of collective decision-making and social psychology. In our research we have developed tools to determine the collective states of online communities from large-scale social media data and have related these measurements to a variety of socio-economic indicators. We have shown how fluctuations of collective mood states match trends in the financial markets, used longitudinal data on the fluctuations of individual mood states to study mood homophily in social networks, and investigated how measures of online attention may yield new indicators of scholarly communication.

The Hearts and Minds of Data Science

Thanks in part to the recent popularity of the buzzword "big data," it is now generally understood that many important scientific breakthroughs are made by interdisciplinary collaborations of scientists working in geographically distributed locations, producing and analyzing vast and complex data sets. The extraordinary advances in our ability to acquire and generate data in physical, biological, and social sciences are transforming the fundamental nature of science discovery across domains. Much of the research in this area, which has become known as data science, has focused on automated methods of analyzing data such as machine learning and new database techniques. Less attention has been directed to the human aspects of data science, including how to build interactive tools that maximize scientific creativity and human insight, and how to train, support, motivate, and retain the individuals with the necessary skills to produce the next generation of scientific discoveries. In this talk, I will argue for the importance of a human-centered approach to data science as necessary for the success of 21st century scientific discovery. Further, I attest that we need to go beyond well-designed user interfaces for data science software tools to consider the entire ecosystem of software development and use: we need to study scientific collaborations interacting with technology as socio-technical systems, where both computer science and social science approaches are interwoven. I will discuss promising research in this area, describe the current status of the Moore/Sloan Data Science Environment at UW, and speculate upon future directions for data science.

Provenance Circuits for Trees and Treelike Instances

Query evaluation in monadic second-order logic is known to be tractable on trees and treelike instances, even though it is hard for arbitrary instances. This tractability result has been extended to several tasks related to query evaluation, such as counting query results and query evaluation on probabilistic trees. We see counting and probability evaluation as two examples of the more general problem of computing augmented query output, that is referred to as provenance. In this talk I will present a provenance framework for tree and treelike instances, by describing a linear-time construction of a circuit provenance representation for MSO queries. We show how this provenance can be connected to the usual definitions of semiring provenance on relational instances, even though we compute it in an unusual way, using tree automata; we do so via intrinsic definitions of provenance for general semirings, that are independent from the operational details of query evaluation. We show the applications of this provenance to capture existing counting and probabilistic results on trees and treelike instances, and give novel consequences for probability evaluation.

Pebble -- a data structure implementation that makes your big data small

Pebble is a project with the vision of building a key value store that allows the storage and retrieval of big data in main memory, instead of distributed systems on disk. Making simpler, faster and cheaper to access and process this data. During the presentation will be exposed the motivations behind this project and current applications. Will be described the algorithm behind the implementation of Pebble, few practical examples will be presented and future work will be discussed. Pebble is based on the work made by Paolo Boldi and Sebastiano Vigna with WebGraph. A reference to their work can be found on this paper: "Paolo Boldi and Sebastiano Vigna. The WebGraph framework I: Compression techniques. In Proc.of the Thirteenth International World Wide Web Conference (WWW 2004), pages 595−601, Manhattan, USA, 2004. ACM Press"

Traversal-Based Linked Data Queries

The emergence of Linked Data on the WWW has spawned research interest in an online execution of declarative queries over this data. A particularly interesting approach is traversal-based query execution which fetches data by traversing data links and, thus, is able to make use of up-to-date data from initially unknown data sources. The downside of this approach is the delay before the query engine completes a query execution. This talk presents our work on addressing this problem based on an approach to return as many elements of the result set as soon as possible. The basis of this approach is a traversal strategy that aims to fetch result-relevant data as early as possible. To this end, we introduce a diverse set of heuristics-based traversal strategies, study their impact on response times, and compare them to a baseline that resembles a breadth-first traversal.

Pushing the Boundaries of Tractable Ontology Reasoning

On the theoretical side, we introduce a class of Horn ontologies, namely RSA ontologies, for which standard reasoning tasks are tractable. This class is general enough to include the OWL 2 EL, QL, and RL profiles. Furthermore, we extend existing combined approaches to solve CQ answering over RSA ontologies, which we show to be feasible in NP. On the practical side, we show that many real-world ontologies that are not included in any OWL 2 profile are indeed RSA, and thus that polynomial time reasoning is possible for these. We also implement a rather efficient reasoner for RSA ontologies making use of existing Datalog engines. See the following for more information:

Decidability of Conjunctive Queries for Description Logics with Counting, Inverses and Nominals

Description Logics (DLs) are Knowledge Representation formalisms with a great significance for Semantic Web technologies. For instance, the ontology specification languages OWL DL and its successor OWL 2 DL, standardized by the World Wide Web Consortium (W3C), are based on DL languages. Conjunctive queries constitute the standard querying paradigm for data bases and have in recent years attracted increasing interest in the area of logic-based knowledge representation. While decidability and computational complexity are mainly clarified for standard DL reasoning tasks (such as knowledge base consistency or axiom entailment), decidability of conjunctive query answering in OWL DL and OWL 2 DL is still an open problem. Over a comparatively long time, a major obstacle towards the solution of this problem was the intricate interplay of three modeling features: nominal concepts, inverse roles and cardinality constraints. In my talk, I will present results that, for the first time, establish decidability of a DL containing all three constructs. For a slightly restricted class of conjunctive queries (i.e., queries with only simple roles), our result also shows decidability in the logics that underpin OWL DL and OWL 2 DL. This work was carried out in the course of a research stay at Oxford University and has been published in the Journal of Artificial Intelligence Research [1]. [1] Sebastian Rudolph, Birte Glimm: Nominals, Inverses, Counting, and Conjunctive Queries or: Why Infinity is your Friend! J. Artif. Intell. Res. (JAIR) 39: 429-481 (2010), available via

Dynamics in Linked Data Environments

The Linked Data best practices for data publishing encourage the use of RDF to describe URI-identified resources on the Web. As those resources reflect things in the real world, which is without a doubt dynamic, the dynamics of Linked Data should not be neglected. In this talk, I present a study on the evolution of Linked Data resources on the web, the Dynamic Linked Data Observatory, which has so far collected 2½ years of weekly Linked Data snapshots. Also, I will introduce read/write Linked Data that enables interaction with resources described using Linked Data -thus introducing dynamics- with the emphasis on how read/write Linked Data can be facilitated using REST, the architecture paradigm of the web. Last, I will show how we used the technologies of Linked Data and REST and the rule language Linked Data-Fu to build a reactive information system incorporating the data integration power of Semantic Technologies.

Monadic datalog on trees

Among logics with fixpoint capabilities, one of the most prominent is datalog, obtained by adding fixpoint operator to unions of conjunctive queries (positive existential first order formulae). Datalog originated as a declarative programming language, but later found many applications in databases as a query language. Classical static analysis problems of containment, equivalence and boundedness have been widely studied on arbitrary structures. All these problems are undecidable. Since these problems are subtasks in important data management tasks, like query minimization, data integration, or data exchange, the negative results for full datalog fuel interest in restrictions. For example when restricting to monadic datalog programs the mentioned problems become decidable. It is also known that the containment problem is decidable for monadic datalog programs on trees. Recently datalog has been studied on tree databases that carry labels from an infinite alphabet that can be tested for equality. I will present some of our new results focusing on the containment problem for monadic datalog. This problem is in general undecidable but I will show some decidable fragments.

CloudMdsQL: Querying Heterogeneous Cloud Data Stores with a Common Language

The blooming of different cloud data management infrastructures, specialized for different kinds of data and tasks, has led to a wide diversification of DBMS interfaces and the loss of a common programming paradigm. The CoherentPaaS project addresses this problem, by providing a common programming language and holistic coherence across different cloud data stores. In this talk, I will present the design of a Cloud Multi-datastore Query Language (CloudMdsQL), and its query engine. CloudMdsQL is a functional SQL-like language, capable of querying multiple heterogeneous data stores (relational and NoSQL) within a single query that may contain embedded invocations to each data store’s native query interface. Thus, CloudMdsQL unifies a quite diverse set of data management technologies while preserving the expressivity of their local query languages. Our experimental validation, with three data stores (graph, document and relational) and representative queries, shows that CloudMdsQL satisfies the five important requirements for a cloud multidatabase query language.

Certain Answers to Well-Designed SPARQL Queries

The need to answer queries over many possible worlds arises in many different settings such as data integration/exchange or in ontology based query answering. The generally agreed approach to query answering in such settings is to compute the so-called certain answers, i.e., answers that one gets by answering the query over any possible world. In this talk, we study the problem of computing certain answers of well-designed SPARQL queries under OWL2-QL entailment. The main challenge comes from the non-monotonicity of the OPTIONAL operator. We thus first define an intuitive semantics of certain answers based on the subsumption relation. To actually compute the certain answers of well-designed SPARQL queries under OWL2-QL entailment, we extend the rewriting-based approach of Calvanese et al. to evaluating conjunctive queries w.r.t. DL-Lite.

Lexicographic orderings

Each countable linear ordering may be represented as the lexicographic ordering of a language over some alphabet. But which linear orderings can be represented by regular, context-free, or deterministic context-free languages? Can we decide whether the lexicographic ordering of a regular or (deterministic) context-free language is a well-ordering, or a scattered linear ordering, or a dense ordering? Do these linear orderings have decidable first-order or monadic second-order theories? Is it decidable whether the lexicographic orderings of two given regular or context-free languages are isomorphic? In the talk I will present answers to the above questions.

SAMOA: A Platform for Mining Big Data Streams

Social media and user generated content are causing an ever growing data deluge. The rate at which we produce data is growing steadily, thus creating larger and larger streams of continuously evolving data. Online news, micro-blogs, search queries are just a few examples of these continuous streams of user activities. The value of these streams relies in their freshness, and relatedness to ongoing events. However, current (de-facto standard) solutions for big data analysis are not designed to deal with evolving streams. In this talk we introduces SAMOA (Scalable Advanced Massive Online Analysis), a platform for mining big data streams. SAMOA provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification and clustering, as well as programming abstractions to develop new algorithms. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm and S4. It is written in Java and is available at under the Apache Software License version 2.0.

Expressive queries over light-weight ontologies

Light-weight Description Logics, such as DLLite, found their applications in practical tasks like Ontology Based Data Access and OWL 2 entailment regimes. These applications require solving not only classical logical problems, like satisfiability and entailment, but also answering database-style queries over knowledge bases. The theory of conjunctive query answering in this setting is now relatively comprehencive. But the real-world applications require a possibility to pose more complex queries over ontologies. In this talk I describe the state of the art in query answering of extensions of conjunctive queries when posed over knowledge bases, such as aggregate queries, queries with different forms of negation and graph-databases style queries.

Techniques to solve computationally hard problems in automata theory

We consider nondeterministic finite automata (NFA) and nondeterministic Buchi automata (NBA), which describe regular languages and omega-regular languages, respectively. Central problems for these are computationally hard (PSPACE-complete), e.g., checking language inclusion, equivalence and universality, as well as finding automata of minimal size for a given language. In spite of the high worst-case complexity, recent algorithms and software-tools can solve many instances of nontrivial size. Here we give an overview over these techniques. They include antichain techniques exploiting logical subsumption, the construction of congruence bases, and automata minimization methods based on transition pruning and state-space quotienting w.r.t. generalized simulation preorders. In particular, multipebble simulations and the more practical lookahead simulations can in polynomial time compute very good under-approximations of the PSPACE-complete relations of (trace-)language inclusion. A collection of current papers and tools is available at

Practical aspects of the research carried out at DAMA-UPC

This talk will review the most significant highlights of the research carried out at DAMA-UPC in the recent months. We will take a look at how Sperksee, the high performance Graph Database has been implemented, and how different technologies have been created around it to grow its capabilities both in terms of new DB functionalities and in terms of new applications being developped. We will take an overview of the Architecture of Sparksee, the ideas behind SCD (the Scalable Community Detection, fastest and most accurate community detection algorithm up to date), Tweeticer (the new tool for understanding the impact of Tweets in Twitter), benchmarking of Graph technologies and other software that has been in production over the last year, leading to three FP7 European Commission funded projects that the group, in collaboration with its Spin out Sparsity Technologies (, are managing at this moment.

Stochastic rule-based modelling of biological systems using a parallel implementation of spatial Kappa

The Stochastic Simulation Algorithm (SSA) is a method to model the behaviour of a system whose constituent elements, modelled as particles, interact by random collisions. Originally introduced by D. Gillespie in 1977, SSA is currenlty used to model stochastic processes in biology such as chemical reactions, cellular signalling, gene regulation and metabolic pathways. During this talk, we will present PISKA, a parallel implementation of the SSA to execute spatially-explicit models coded in the Kappa formal language. Some biological models and their simulations using PISKA will be discussed. Among other important issues, time-travel and multiple clock synchronisation processes will be also presented.

Representaciones compactas de grafos web, redes sociales y RDF

Representar en forma compacta un grafo, permitiendo operaciones de navegación en el formato comprimido, es útil para ejecutar algoritmos de análisis de grafos sobre grandes redes en memoria principal. La representación compacta de familias de grafos requiere descubrir sus regularidades para poder explotarlas, es decir, de cierto modo comprender cómo son este tipo de grafos. En esta charla hablaré de los resultados que hemos obtenido en la compresión de grafos web, de redes sociales de distinto tipo, y de repositorios RDF.

R2RML: Standard RDB2RDF Mapping Language

Upgrading legacy relational database into RDF datasets plays an important role in the emergence of Web of Data. In order to do so, a mapping language that specifies the transformation rules is needed. In this talk, we will see R2RML, a W3C Recommendation for RDB2RDF mapping language, study its main concepts and some examples.

Contexts and Multidimensional Data Quality Assessment and Cleaning

Data quality and data cleaning are context dependent activities. Starting from this observation, in previous work a context model for the assessment of the quality of a database instance was proposed. In that framework, the context takes the form of a possibly virtual database or data integration system into which a database instance under quality assessment is mapped, for additional analysis and processing, enabling quality assessment. In this work we extend contexts with dimensions, and by doing so, we make possible a multidimensional assessment of data quality assessment. Multidimensional contexts are represented as ontologies written in Datalog±. We use this language for representing dimensional constraints, and dimensional rules, and also for doing query answering based on dimensional navigation, which becomes an important auxiliary activity in the assessment of data.

Semantic Annotators - how good are they, and how can we make them better?

This will be an informal talk discussing a line of research in Oxford on using semantic annotation services. The talk will overview what these services are, and discuss what we have discovered experimentally about their behavior. It will then go over some approaches we have developed for increasing the reliability of annotators, embedded in the ROSeAnn system for integrating and reconciling semantic annotations. This is joint work with Luying Chen, Giorgio Orsi, and Stefano Ortona, to be presented at VLDB’14.

Generating knowledge from online social networks

The rise of online social networks has massified the notion that Web users are active publishers of their own content. Users now constantly share pictures, videos and their objective and subjective perceptions of the world they live in. In particular, microblog platforms such as Twitter have made online social networks an extremely rapid environment for disseminating, almost in real-time, news and personal experiences. The democratization of content publication has brought several new challenges, among these, the analysis of large volumes of streaming data and the understanding of higher level abstractions such as sentiment, opinions and events. In this talk, I will discuss some examples of how social knowledge can be extracted from microblog content, such as, credibility prediction, event detection, and the analysis of cultural differences between countries.

Querying Graph Databases

Graph databases have gained renewed interest in the last years, due to their applications in areas such as the Semantic Web and Social Networks Analysis. We study the problem of querying graph databases, and, in particular, the expressiveness and complexity of evaluation for several general-purpose navigational query languages, such as the regular path queries and its extensions with conjunctions and inverses. We distinguish between two semantics for these languages. The first one, based on simple paths, easily leads to intractability in data complexity, while the second one, based on arbitrary paths, allows tractable evaluation for an expressive family of languages. We also study two recent extensions of these languages that have been motivated by modern applications of graph databases. The first one allows to treat paths as first-class citizens, while the second one permits to express queries that combine the topology of the graph with its underlying data.

Benchmarking RDF and Graph Databases

With inherent support for storing and analyzing highly interconnected data, graph and RDF databases appear as natural solutions for developing linked open data applications. However, current benchmarks for these database technologies do not fully attain the desirable characteristics in industrial-strength benchmarks (e.g. relevance, verifiability, etc.) and typically do not model scenarios characterized by complex queries over skewed and highly correlated data. The Linked Data Benchmark Council (LDBC) is an EU project that aims the development of industrial-strength benchmarks for graph and RDF databases. The Social Network Benchmark (SNB) is an LDBC benchmark intended to test various functionalities of systems used for graph data management. The objective of this talk is to present the status of the SNB and to describe the problems concerning its development.

The Anatomy of a Large-Scale Semantic Web Search Engine

In this talk, I will try to paraphrase the goals of the nucleus centre (as I perceived them from the introductory talk) based on my own complementary experience of research on semantic search. I will try to sketch the design of a system that can enable the type of semantic search that was put forward in the funding pitch. I will then try to give an impression of why such a system does not already exist: for each component of the system, I will cover the main challenges, what is the state of the art (including work I've already been involved in), and what are the research questions to tackle going forward. Though I doubt we can make Google bankrupt by 2017, I do hope that the talk can help to identify important research questions in the area ... questions that hopefully match up with our joint expertise.