Five New PhD Graduates at Ghent University in Big Data Analytics

During the first half of 2018 in total five PhD students have graduated in the field of Big Data Analytics with Prof. Dr. Dirk Van den Poel as supervisor at Ghent University. We would like to thank the different international experts involved for assessing the work. These students have already produced 10+ publications for internationally renowned journals. As to the future of these new PhDs/doctors. Two of them already accepted job offers from foreign business schools (UK and France) to become assistant professors in data analytics. Congratulations to all five new PhDs!

Here are more details about each of them (in reverse chronological order):

Dr. Matthias Bogaert

(Public Defense on June 28, 2018, co-advisor: Prof. Dr. Michel Ballings)

His work is titled “Harnessing the Power of Social Media in Predictive Analytics.”

Thursday, 28 June 2018

Summary of Dr. Matthias Bogaert’s work:

Social media data are becoming increasingly central to firms’ efforts to understand buyers and develop effective marketing strategies. The reasons are manifold. First, social media buzz has proven to have a significant impact on key customer metrics, such as customer spending, cross-buying, and profitability. Second, the volume of social media data is unprecedented. For example, Facebook has more than 2 billion users, corresponding to a staggering 25% of the world population. Finally, social media data contain a lot of information about the preferences and the characteristics of the users. Companies that want to advertise on social media can adopt two main strategies. The first one is an organic strategy. This implies that companies try to stimulate word-of-mouth by paying for more organic reach and/or by optimizing their social media content. Another option is to choose for a one-to-one strategy. This strategy focuses on identifying the users who are most likely to buy your product and target them directly with personalized ads. In order to implement such an oneto-one strategy, it is important to know whether social media data have predictive value. The goal of this dissertation is thus to harness the predictive capacity of social data on different levels of analysis. Chapter 2 investigates on the user level whether Facebook friends data have added value in event attendance prediction. The findings show that Facebook friends data significantly improve event attendance models in a majority of the cases. Moreover, we find that the number of friends that attend the event is one of the top indicators of event attendance. Chapter 3 focuses on the network level. This study investigates whether disaggregated variables can predict romantic partnership on Facebook. The results reveal that it is possible to predict somebody’s significant other with high predictive accuracy. We also show that disaggregated variables, such as comments and likes on photos and videos, are among the top predictors of romantic partnership. Chapter 4 is situated on the most aggregate level, namely product performance. This chapter studies which social media platform (Facebook or Twitter) is the most predictive of movie sales. The results indicate that Facebook is significantly more indicative of movie sales than Twitter. The results also show that user-generated content does not significantly increase the predictive power of models based on marketer-generated content and page popularity indicators of the Facebook and Twitter page.

Dr. Matthijs Meire

(Public Defense on June 19, 2018, co-advisor: Prof. Dr. Michel Ballings)

His work is titled A Marketing Perspective on Social Media Usefulness.

Summary of Dr. Matthijs Meire’s work:

Social media represent all internet-based applications in which customers can create and share the content, and where they can interact with each other; the most well-known examples are Facebook, Twitter, Instagram, Snapchat and blogs. Nowadays, social media is also used by companies as a part of their marketing mix, with the main advantages named being the possibility to interactively engage with their customers and connect with them. Despite this growing interest in social media and the large investments, the return on these investments is still debated and academic research on the effects of social media marketing is lagging. Chapter 1 focuses on some of the gaps that still exist, explains how this dissertation aims to contribute to literature and shows how social media can contribute to business value. More specifically, we answer the following three questions in this dissertation: (1) how to provide more accurate estimations of sentiment in online word-of-mouth?, (2) How does social media and customer sentiment impact customer value to the firm? and (3) How can businessto-business (B2B) firms use social media optimally within the sales cycle? The three studies in this dissertation are related in that they each use Facebook information as the main source of information, and because they extend the analytical toolset available for the management of customer relationships. For the first study (chapter 2), we start from specific user information (Facebook posts). For the second study (chapter 3), we use a combination of individual user information, combined with marketer generated content from Facebook pages (company information). Finally, in study 3 (chapter 4), we exclusively focus on Facebook pages and company information to set up a business-to-business acquisition prediction system. In chapter 2, we investigate how automatic sentiment detection on social media can be improved. Social media offer a lot of potential for marketers to retrieve information about customers. However, most of this information is unstructured, and it’s meaning has to be inferenced in some way. We focus exclusively on textual content, and more precisely Facebook posts, and aim to discover and predict the sentiment of these posts. We start from a broad baseline sentiment classification model, based on the extensively available previous literature, and we suggest two alternative types of extra variables to complement these models. The first type of variables comprises leading information, with which we mean information that is available before the actual content was posted. Examples of this type are sentiment in previous posts and general user information such as demographics. This information also allows to look at deviations from ‘normal’ posting behavior to detect changes in sentiment. The second type 5 of variables are lagging variables, which contain information that becomes available only some time after a post has been published. The most noteworthy examples are likes and comments gathered for this post after, for instance, 7 days. We split these information types since leading variables could be used in real-time sentiment classification, while lagging variables will never be real-time. We subsequently build three sentiment classification models, using 5*2 fold cross validation Random Forest models in order to evaluate the added value of the leading and lagging variables. The results show that both leading and lagging variables create significant and relevant value over and above the baseline model. It turns out that deviations from ‘normal’ posting behavior as well as comments and likes substantially increase our models’ performance. We also see that the traditional textual information, leading and lagging information are all complementary and add to model performance in the most complete model. These results have high practical and academic value, since valence is commonly used in marketing as it has a demonstrated relationship with sales, which makes it important to correctly measure valence. Furthermore, consumer sentiment or satisfaction about a brand can be deduced from social media. Customer touchpoints are all occasions in which customers can relate to a firm, and comprise both passive (e.g., seeing advertisements) and active (e.g., purchasing) moments. In chapter 3, we link the outcome of such customer touchpoints to online customer sentiment measured by Facebook comments. Moreover, we propose that (online) marketer generated content, following the specific touchpoint, can moderate the impact of the result of the touchpoint on the subsequent displayed sentiment. Finally, we link individual customer sentiment to direct engagement (also known as customer lifetime value (CLV)), in combination with several control variables linked to customer-firm interaction data. For this research, we compiled a unique dataset which features an unprecedented set of brand-related customer-level social media activity metrics, transaction variables at the customer level, variables capturing objective performance characteristics of the customer touchpoint and other marketing communication variables. By using a two stage model in which we first model customer sentiment in a generalized linear mixed effects model, followed by a Type II Tobit model for engagement, we show that marketer generated content is able to influence customer sentiment following more negative service encounters, and that customer sentiment is related to direct engagement, even when traditional control variables are included. Finally, this research also shows that the most used Facebook metric, a page like, has no significant effect on direct engagement. 6 Most of the current research focuses on social media usage for in Business-to-Consumer (B2C) environments, with a focus on the interactivity of conversations and the potential value of electronic word of mouth. In the final chapter, we investigate how Business-to-Business (B2B) organizations can use social media in their sales processes. Indeed, businesses create social media content, and this information can subsequently be used by other companies in their acquisition process. We propose a customer acquisition prediction model, that qualifies a companies’ prospects as potential customers. The model compares social media (Facebook) information of the prospect with two other data sources: web page information and commercially purchased information, and we test the model with a large scale experiment at Coca-Cola Refreshments, Inc. The results show that Facebook information is most informative, but that it is complementary to the information from the other data sources. Moreover, this research shows how the modeling efforts can benefit from an iterative approach, and we demonstrate the financial benefits of our newly devised approach. To summarize, in this dissertation we were able to respond to some relevant and important questions related to marketing and it’s interaction with social media, thereby delivering both theoretical and practical contributions.

Dr. Dauwe Vercamer

(Public Defense on June 14, 2018)

His work is titled Integration of Customer Data in a Fleet Scheduling Optimization Problem.

Summary of Dr. Dauwe Vercamer’s work:

In the last decade, analytical methods for data analysis have become ubiquitous in the decision making process of big companies. Typically, this data analysis consists of three distinct phases. The first one is descriptive analytics. In this phase, historical data is assessed and underlying patterns are discovered. Predictive analytics focuses on the underlying reasons / rules for those patterns and tries to leverage those rules for predicting future outcomes. The last phase is prescriptive analytics in which the predictions are used to make the best possible decisions. While descriptive analytics is most often used, the added value is the smallest. On the other hand, prescriptive analytics is much less used but has the opportunity to deliver much higher value. This dissertation applies each of these components throughout the different chapters and aims to find out the additional value of integrating the different phases. The first study focuses mostly on descriptive and predictive analytics. It is applied in the utilities industry and looks specifically at Automated Metering Infrastructure. More and more, these meters are being installed at customer locations and enable companies to understand the power consumption behaviors of their customers. This data is often aggregated into Typical Daily Profiles (TDP) and is used for tariff setting and demand-response modeling [25]. In order to properly do this, there are two necessary components: (i) these TDP have to be smartly clustered so as to avoid targeting each customer individually and (ii) the clusters need meaningful descriptions so that new customers, or customers for whom no metering data is available, can be assigned to one of them. This study addresses the latter. This post-clustering phase [25] has received little attention in literature. Specifically, commercial, governmental and open data are added to internal company data in order to predict these TDP clusters. This is done through Machine Learning methods such as Random Forest [19] and Stochastic Boosting [50]. This approach was tested on data of 6000 GDF-SUEZ SME customers and resulted in six different TDP. The results show that a specific combination of commercial data with internal data and public cartographic data has the highest accuracy in predicting these six groups. The second study dives deeper on prescriptive methods. It focuses on decision making applied to large-scale Vehicle Routing Problems (VRP). The algorithm uses a smart clustering of customers using Hierarchical Clustering [133] to reduce the search neighborhood. This clustering is used in two ways. Firstly, it is used to create regions of customers and secondly, to create small groups of neighboring customers that are further treated in the algorithm as one single customer. Additionally, a penalty is applied to create more compact routes. One of the advantages of compact routes is that when some of the customers are not at home on a first visit, it is much easier to revisit them at the end of the day. The results indicate that our algorithm is competitive in comparison with other large-scale routing algorithms when compared on benchmark instances. Furthermore, it shows our clustering methods are very effective in speeding up the algorithm while only losing minimal solution quality. Finally, our compactness penalty ensures the revisit times for customers who were xv xvi SUMMARY not at home is significantly reduced. The downside of this is that compact routes require a much higher amount of vehicles to visit the same amount of customers, driving overall costs up. Therefore, it is mostly useful for situations where uncertainty around whether a customer will be home is high. The last study integrates predictive and prescriptive analytics. It builds upon the previous one, but dives deeper into the profitability of the routes. To increase route profitability, it is necessary to integrate the prediction (or forecast) of the contribution margin of a customer with the prescriptive algorithm. Specifically, the study evaluates the value of auxiliary data to improve total profitability. Most papers that focus on VRP with profits use time series based values in their prescriptive models. The results of this study confirm that adding additional information on top of that significantly increases the accuracy of the contribution predictions as well as total profits. It is also clear that one cannot just look at the accuracy of the predictive method to make decisions. The results show that some predictive methods with higher accuracy fared worse when looking at total profits. The study also indicates the typical management heuristic of making sequential decisions (first predicting the contribution margin, eliminating some customers based on that and then routing the remaining customers) is worse than making integrated decisions. The reason for this is that additional customers decrease the relative cost as density increases. Also, a less profitable customer living next to a profitable customer can be worthy of visiting as the additional cost to visit this customer becomes very low. As a general conclusion, this dissertation argues that all phases of the analytical methods are important, but that maximum value is created when the different phases are will integrated.

Dr. Steven Hoornaert

(Public Defense on March 28, 2018)

His work is titled The value of social media for marketing and innovation.

Summary of Dr. Steven Hoornaert’s work:

Social Media in 2018 is more than a communication channel for firms to interact with customers, but can provide them with tools to track their brand health, explore opportunities for innovation, and track customer satisfaction (among others). Yet, faced with the volume and variety of the unstructured content (i.e., text) of social media, firms are hesitant to explore the value of these data and are faced with limited academic and practitioner support on how to unlock this value for their business. The aim of the current dissertation is to show how social media data can be used for two firm-specific outcomes that directly influence the firm’s future and current bottom line: idea selection for innovation and product consumption. This dissertation responds to this gap in academic research in three essays: the first essay proposes an idea selection support system that uses latent semantic indexing to capture information from idea suggestions on crowdsourcing platforms, while the second and third essay capture information through a lexicon-based approach on the social networking site Twitter. In Chapter II, Identifying New Product Ideas: Waiting for the Wisdom of the Crowd or Screening Ideas in Real Time, we investigate consumer ideas from an online crowdsourcing community called Mendeley. Unlike one-off initiatives (e.g., idea contests), the design of the platform can capture information beyond the text of the idea suggestion (content) and can capture feedback of community members on this suggestion (crowd), and information on the community member suggesting the idea (contributor). We label these three components jointly as the “3Cs” of idea selection. Whereas content and contributor information is available immediately after the idea suggestion, it takes time and effort of the crowd to evaluate these ideas. This leads to the question of whether the firm gains sufficient value in waiting for the crowd’s evaluation to assess which ideas it should implement. Across multiple methods, our results show that including crowd feedback improves the identification of implemented ideas 5 between 31.7% and 61.0% over ranking by votes, between 16.6% and 42.5% over ranking by comments, and between 48.6% and 81.6% over random idea selection. Additionally, we find that the predictive performance of crowd feedback is much larger than for content and contributor. We find that ideas should surpass an initial threshold in order to facilitate implementation: an idea that receives at least one vote substantially increases its odds of implementation, but receiving more than one vote, does not increase this likelihood further. For idea content, ideas very similar to previous ideas (less distinctive) and ideas very dissimilar to previous ideas (more distinctive), are more likely to be implemented whereas ideas stuck in the middle are less likely to be implemented. In Chapter III, The Dynamics between Social Media Engagement, Firm-generated Content, and Live and Time-shifted TV viewing, we investigate customer engagement as a dynamic, iterative process in the context of the TV Industry. We propose a theoretical framework involving the central constructs of brand actions, customer engagement behaviors (CEBs), and consumption. Brand actions of TV shows include advertising and firm-generated content on social media (FGC). CEBs include volume, valence, and richness of user-generated content on social media (UGC). Consumption comprises of live and time-shifted TV viewing. Using a sample of new TV shows introduced in 2015, we estimate a simultaneous system of equations model to operationalize our framework. We find that advertising efforts initiated by the TV show have a positive effect on time-shifted viewing, but a negative effect on live viewing. Tweets posted by the TV show (FGC) have a negative effect on time-shifted viewing and no effect on live viewing. Finally, negative sentiment from tweets posted by viewers (UGC) reduces time-shifted viewing, but increases live viewing. In Chapter IV, The Effects of Program Scheduling and Paid, Owned, and Earned Media on TV viewing, we investigate how paid (advertising), owned (firm-generated content; FGC), and earned media (user-generated content; UGC) influence live ratings after controlling for 6 program scheduling characteristics in the form of the competitive program scheduled at the same time as the focal program, the leadin program scheduled before the focal program, and the leadout program scheduled after the focal program. Our paper is the first to investigate the effects of paid/owned/earned media combined with leadin, leadout, and competition ratings. We estimate a dynamic panel model using the General Method of Moments framework. Across these models, our results show a consistent significant positive effect of the rating of the leadin program on the rating of the focal program. Interestingly, we find no significant effect of social media or advertising after controlling for the dynamic effect of a program’s rating. But, after controlling for leadin, leadout, and competition, we observe a negative significant effect of positive valence. We discuss the implications of these results.

Dr. Matthias De Beule

(Public Defense on January 11, 2018, co-advisor: Prof. Dr. Nico Van de Weghe)

His work is titled Advanced geomarketing modeling techniques for retail networks.

Summary of Dr. Matthias De Beule’s work:

In the last decades, physical retail has come under severe pressure, due to growth of both physical and e-retailing supply. While retailers have to rival more intensely for consumers with shortening decision-windows, their understanding on the very same consumer has also grown. This improved understanding is facilitated by the rise of Spatial Decision Support Systems that combine sets of high-quality data on both customers and non-customers and computation power to turn these data into actionable insights. Location planners can now use transactional, socio-demographic and store-related data to make better decisions on store openings, modifications or closings. A popular geomarketing technique to predict the financial outcome of such decisions is Spatial Interaction Modeling (SIM). SIM models spatial consumer behaviour as monetary expenditure flows from geo-referenced, aggregated consumer origins towards stores. These stores compete for the consumer spending potential by exerting a gravity-force like attraction on consumers, with the magnitude of attraction depending on store and consumer attributes and the geographical distance between both. The parameters of a SIM are optimized based on observed but partial expenditure flows (based on f.e. loyalty cards), store and enterprise turnovers. This dissertation aims firstly at improving the fundamental understanding of two specific spatial consumer behaviour aspects: the specific choice dynamics of consumers for which two or more stores of the same brand spatially compete (sales cannibalization) and the impact of the features and format of the superordinate retail area on store (area) choice. Secondly, this dissertation aims at ensuring that a SIM is highly applicable for location planners in practice. Chapter 2 constructs a predictive spatial interaction model for the Belgian grocery market that is based on basic data sets and that yields robust predictions thanks to result validation on several levels of observed performance data. The incumbent model formulation is extended to incorporate sales cannibalization dynamics and it is proven that it contributes to the overall predictive power of the model. Chapter 3 looks to the sales cannibalization dynamics beyond the grocery market and makes a multi-retailer comparative study of store trade areas where sales cannibalization is likely to be present. Varying degrees of sales cannibalization are detected across product types and expansion strategies. Moreover, a varying impact of the superordinate retail area on sales cannibalization is found. Chapter 4 elaborates further on retail areas and links the commercial success for shopping-oriented goods within two shopping area formats in Flanders (city centers and out-of-town shopping strips) to different attributes. These attributes are based on qualitative input from a consumer survey and quantitative spatial configuration metrics. City center commercial success mainly depends on ambient and social elements, while commercial success for the same goods for shopping strips depends on its accessibility by car. Finally, it is shown that the findings of this dissertation are useful for different retail stakeholders: retailers, retail real estate developers and managers, and government urban planners and policy makers.

next >

< previous