Centralizing and Mastering Social Media Data

BY Hmrizin, Mastech Infotrellis · April 1, 2013 · Leave a Comment

Yes. Social Media is important for business. Thanks to the analysts, advocates, industry experts and the zillion articles & blog posts on the topic.

Now the question is where to start and how to go about consuming Social Media Data. What are the steps involved? Are there any best practices, frameworks, patterns around consuming Social Media Data? This blog tries to answer some of these questions, keep reading…

Within the general IT community we’ve learned many lessons over the years that lead us to the conclusion that data is an enterprise asset and we need proper controls and governance over it.  Only then can we trust the results of analytics, use it in real-time processes with confidence and gain operational efficiencies.

Social media data, or external data in general, is no exception to this.  We can and must apply techniques we commonly apply in master data management, data quality management and enterprise data warehousing disciplines so that we can draw the most value possible out of social media data.  On top of that there are additional techniques to apply given the unique nature of this type of data.

This article proposes the concept of a centralized and managed hub of social media data, or “social media master”.  It addresses the following topics:

  • Justification
  • Data Acquisition
  • Enrichment
  • Data Quality and “Data Quality Measures”
  • Relevance (i.e., finding the signal in the noise)
  • Consumption (i.e., use of the data)
  • Integration with other data and “Social Media Resolution”
  • Governance


This article does not go into much detail on why you should centralize and manage social media data and I would assume information management practitioners would accept this as the right thing to do.  Instead the question is “when is the right time to do it?”  It is natural to take a project-based and simple approach to managing social media data when starting out with only one or two initiatives.  But you do not want to fall into the trap of having multiple initiatives on the go, each with their own silo of data that is inconsistent, incomplete and with unknown quality.  That is reminiscent of the proliferation of project and departmental based data warehouses in the 1990’s that many organizations are still trying to address in their enterprise data warehousing strategies.

Data Acquisition

The first and foremost task is to collect the social media data of interest.  There are of course different ways this can be done such as:

  • Subscribing to the social media site’s streaming API (i.e., data is pushed to you).
  • Using the social media site’s API (i.e., you pull data from them).
  • Purchasing data from a third party provider.

All of the popular social media sites like Twitter and Facebook have well documented APIs, schemas and terms and conditions.

The method you choose will depend on the criteria and volume of data you expect.  For example, if you simply want all Tweets that mention your company name (or some other keywords) then perhaps subscribing to the social media site’s streaming API may be sufficient.  If you expect a large volume of Tweets and you want to go back in time then you will likely have to get that data from a third party provider.


All popular social media sites have a well-defined schema that describes the content of the data.  And the content for many includes the same basic data such as a user id (or handle), a name, location information, timestamp data and of course the actual social content such as the 140 characters of Tweet text.

This is raw data and should not be considered ready for consumption as you first need to apply quality functions to the data, measure the quality and also create relevance measures so you can find “signal in the noise”.  There is also opportunity to enrich the data, which not only helps with quality and relevance measures but also provides additional data that can be very useful in analytics.  Let’s look at a few examples.

The first example is enriching a Twitter user’s profile with gender information.  By analyzing the user’s name and handle it is possible to derive gender along with a confidence level.  It is not possible in all cases but is possible in many.  Gender is, of course, a very important dimension in analyzing data for many organizations.

A second example is by analyzing the text of the social content.  For example, are there any mentions of brand names, product names or competitors?  A simple yet effective way is to use reference files of keywords and simple string matching to pull out this information.  Another way is to use more advanced natural language processing (NLP) and machine learning techniques to do this, which is better suited for enriching the raw data with things such as sentiment and categories.

A final example is with Four Square check-ins (over Twitter), which broadcasts a user’s location such as “I’m at Lowe’s Home Improvement (Mississauga, ON)”.  Different check-in services have different formats but Four Square check-ins are usually in the format of “I’m at <Store-Name> (<Place>, <State/Province>)”.  This is packed full of good information even though it is brief.  You can pull out not only city and state/province level information but also store level information that can be matched to a reference file of stores and used as a dimension in analytics.

You have the ability with enrichment techniques to augment the raw data with additional data that is very useful both in downstream use (analytics and real-time processes) but also in subsequent activities for mastering the data.

Data Quality and Data Quality Measures

Data quality is an activity that cannot be ignored in any data management/integration exercise and the same applies to social media data.

Different quality functions can be applied depending on what data is available to you.  One simple example is in analyzing free form location information that can in formats such as “Toronto”, “Toronto, Ontario”, “Toronto, On”, “Toronto, ONT”, “T.O.”, “Toronto Ontario”.  This data can be put into a standard format so it is consistent and uniform across the data set.

Given this is external data that is not under your control, it is very important to not just apply quality functions but to also measure the quality.  For example, what is your level of confidence that the city data refers to an actual city?  Your confidence in whether or not the user lives in that city is a different matter all-together, however.

When you create and manage a hub of social media data you can expect the multiple consumers will have different uses of the data.  They will therefore pick data that is appropriate for them and that they believe is “good enough” for their purpose.  This is why measuring the quality of the data is important, if not critical.


One important quality measure is “relevance”.  Ultimately, relevance is contextual because one set of data that is relevant to one user may not be for another user.  However, it is important to create a relevance, or qualification score as a basis.

By definition “Qualification = Quantifying the confidence you have in the quality of the data” .  In simple terms the question that needs to be answered is “how confident are you that this particular social media content, such as a Tweet, is relevant for you?”   As an example, a Tweet that you’ve acquired that has nothing to do about your company, competitors, products or brands may have a low (or zero) relevance and therefore “filtered out” from being used in downstream processing.

This is finding the signal in the noise and ensuring consumers use data that matters, which provides better business outcomes.


The “managed” social media data can be consumed from a centralized hub once the data has been acquired, enriched, enhanced with quality and measured.


Just like an MDM hub, the consumers can be analytical or operational (real-time) in nature.  And just like an MDM hub, best practices should be followed in terms of security, audit, setting SLAs and having the right infrastructure components in place.

One major difference between an MDM hub and a social media hub is the level of trust and confidence in the data.  This is not a topic that can be ignored with MDM hubs but we are in a different game since we are dealing with external data versus internal data.  That is why enriching, measuring quality and measuring relevance of the data is critical.  It provides the ability for consumers to work with the data that is appropriate for their needs and tolerance levels.

It is also important to have a well-defined schema in place.  Much of the actual social media content is unstructured however there is structured data around it and the enrichments and quality measurements are structured.  Just because it is well-defined doesn’t mean it has to be normalized into a fully typed relational model or object model.  What is most important is there are basic structures in place to aid in consuming the data.


Integrating with other data and Social Media Resolution

Social media data, just like reference data, master data and other types of data, is not an island.  It is when you combine qualified and relevant social media data with your internal data that you have huge business potential.

In some cases an organization may have Twitter handles or other social account identifiers in their MDM hub that they can use to join to a social media hub to see relevant activity.  But for most this is not the case and instead they would need to look at “social media resolution” as a technique to match and merge data.

Social media resolution comes in two forms:

  • Resolving identities/accounts across social media services (e.g., a Twitter user to a Facebook user).
  • Resolving social media identities/accounts to internal data (e.g., a Twitter user to a customer in an MDM hub or enterprise data warehouse).

This is a very different problem than matching “customers to customers” and different data points, techniques and technologies are required to make it happen.


All enterprise assets need to be properly governed and to govern something you need to first measure it.  Therefore it is important to capture and analyze key measures of the social media hub such as:

  • New data acquired (how the data is changed)
  • Quality of the data and how it is trending (we can do this since we measure the quality)
  • Success in enriching the data
  • Who is using the data and how are they using it

Below is an example of a Twitter Dashboard that is used to get insights into the key measures of a social media hub containing Twitter accounts and Tweets.



The past has taught us that we need to be proactive and properly manage data as an enterprise asset if we want to get the most out of it and have confidence in what we get.  Social media data is no different than this and hopefully this article has provided you some insight into what a social media hub can look like and what it must do.

If you agree or disagree or want to chat further on the topic then please leave a comment or contact me directly!

allsight, bigdata, data aquisition, data governance, Data Quality, data silos, enrichment, facebook, InfoTrellis, master data analytics, Master Data Management, MDM, qualification, relevance, social media, social media master, social media schema, tweets, Twitter

Leave a Reply

Your email address will not be published. Required fields are marked *