Many organizations are analyzing Tweets for various purposes such as sentiment at an aggregate level. For example, “generally what are people saying about us in the Twitter universe?” This is a good baby step into Big Data Analytics but where organizations want to get to is “what is my customer John Smith saying about us?” This customer-level analytics is much more valuable as it allows the organization to serve the customer better, identify “market of one” opportunities and so on.
You have to match Tweets to customer records as a pre-requisite to such analytics. So what are the considerations in doing so? It is a key capability of MDM hubs to address the problem of matching customers together by using structured data sourced from internal systems within the organization and applying traditional deterministic and/or probabilistic matching techniques. But the problem shifts dramatically when trying to match Big Data together. You need to re-think a solution given the problem has changed.
Many are familiar with Twitter and Tweets. What some don’t know is that there is a set of metadata that is distributed with each Tweet. Some of it is useful for matching purposes such as the user’s name, Tweet timestamp, high level location information and so on. This information along with information in the text of the Tweet triangulated with internal information can yield high quality matches.
So below are some considerations in matching Tweets to internal customer records.
Understand the quality:
Each Tweet is delivered along with metadata including user profile information such as name and location (eg, city and state). The challenge is that this information is entered by the user in a freeform way. Not only may it come in different formats, but it may also be completely invalid. For example, the name field may contain a name, an alias or anonymous type of data. To make things worse, the best you can do is profile a sampling set of data to look for patterns.
Any elements extract from the Tweet (including the metadata) needs to be analyzed. This insight should then influence matching. Take for example “Jade R”, “Wendi Goodman”, “ID CRE8TIVE” and “Mad Squirrel” (all real examples). Which names do you think would yield the best results in matching?
Filter out the noise:
You may have a significant amount of “noise Tweets” depending on how you obtain the data. For example, a Tweet search on “lowe’s OR lowes” I just performed contains only about 15% – 20% Tweets that are about “Lowe’s Home Improvement”. While I find it interesting that Rob Lowe hates decaf coffee, I doubt Lowe’s Home Improvement would want to consider that Tweet in its analytics.
You therefore have to interrogate each Tweet and either qualify it or tag it as noise.
It is also useful to use an “Ignore List” that filters out Tweets that come from a particular user, fit a pattern or are re-Tweeted often. For example, consider Walgreens that subscribes and gets this tweet from “Grocery Coupons”:
“B1G1 50% off Neutrogena Sun Care Items at Walgreens + $1 off Coupon!”
It has a brand “Neutrogena” in it, but unless you are interested in who is advertising promotions on Twitter this won’t go very far in matching to customer records.
Get insights by analyzing the text of the Tweet:
While 140 characters isn’t a lot to capture information, there can still be (and often is) some gold nuggets in there that are useful for matching purposes. Take this real Tweet found in a simple search of “home depot delivery”: I ordered a spa/hot tub from Home Depot 5 weeks ago. We’ve been trying to determine the delivery status for 3 weeks.
Natural Language Processing (NLP) and text analytics has been advancing in both functionality and useability. Even without advanced processing there are key insights in this text. For example:
First sentence: “ordered” indicates this person is a customer given its tense. “spa/hot tub” indicates a type of product. “5 weeks ago” indicates a time something (eg, the order) took place.
Second sentence: “delivery” indicates, well, there is a delivery of something. “3 weeks” indicates the time of something but without proper text analytics it can be difficult to understand if it indicates that delivery is in 3 weeks, was 3 weeks ago or something else.
These insights pulled from the text are valuable in the matching exercise.
Think beyond demographic matching elements:
This is an extension to the about consideration. Typically demographic information such as names and addresses are used when matching “party records” together (whether they play a role of customer or prospect or other). To effectively match Tweets to a customer you need to also consider information such as transaction data (that includes type of transaction, date, location).
Normally this is a daunting thought when it comes to performance, but new technologies such as Hadoop (and other Big Data platforms) make this a possibility.
Use a set of tweets:
People that Tweet tend to Tweet often and use popular services like Four Square and Yelp, which broadcasts their location. You should match a set of user’s Tweets to customer (or product) data instead of matching just a single Tweet from a user to customer. This way you will have more matching data to work with, which should help with higher quality matches.
Take these two real Tweets sent from a single user, less than 5 hours apart:
“I’m feeling manly today. Coat rod, shelving unit and A/C unit install today. Home depot is my friend. Home depot may be addicting.” (sent 34 minutes ago)
“I’m at Home Depot (Chicago, IL) 4sq.com” (sent 5 hours ago)
The two Tweets combined provide extra information that can be used in matching.
Data stewardship over meta data, reference data and master data is achievable because data sets are usually manageable. But given the volume (and velocity) of big data it can be a daunting task to have to make many manual decisions such as confirming a match of a Tweet to a customer record. However data stewardship is still very important for big data but you have to re-think it to a degree. And one good example is with machine learning. The system needs to learn from actions the steward makes so it can apply them to other data sets and minimize the need for manual intervention.
Record your knowledge:
You should record the outcome once you’ve matched a Tweet to a customer record with sufficient confidence. For example, storing the user’s Twitter account id in an MDM hub and perhaps some summary information such as their social media influence (number of followers, number of Tweets) along with a date since it is point in time data. This way you can automatically match new Tweets from that user using the account id.