Bots won’t buy you any fan engagement – or will they?

Fan Engagement in the MLS

We’re about three months into the MLS season – time for another edition of the highly in-official Hype-O-Meter. With two important changes:

  1. How “real” are your followers: When Atlanta United entered the scene, there was much debate on whether how much they had bought themselves into their “from-hero-to-zero” status on Twitter. An expansion team suddenly having the most followers in the league – smells like fake. And it seems to be exactly that. Inspired by the press coverage of Trump’s Twitter exploits, I used “Twitter Audit” to analyze a sample of Twitter followers for each team. Although this is not a perfect representation for the exact makeup of each team’s followers (e.g., I did not pay for a pro account and therefore had to rely on data from earlier analyses, only sub-samples were analyzed), it can serve as a decent proxy.

“Each audit takes a sample of up to 5000 (or more, if you subscribe to Pro) Twitter followers for a user and calculates a score for each follower. This score is based on number of tweets, date of the last tweet, and ratio of followers to friends. We use these scores to determine whether any given user is real or fake. Of course, this scoring method is not perfect but it is a good way to tell if someone with lots of followers is likely to have increased their follower count by inorganic, fraudulent, or dishonest means.”

What I found was quite interesting. As some commentators had suspected earlier: Atlanta seems to have added quite a bit their follower count. In fact: Only 48% of their 517,134 followers are deemed real. Interestingly, though, that does not seem to hurt them. Usually, simple bots or fake accounts won’t buy you any fan engagement (as measured in likes and retweets), but Atlanta still comes in a solid 3rd. However, they are the exception. There are several other teams in the MLS that seem to have added to their follower count — and as a result, find themselves mostly towards the bottom of the engagement chart.

Here is the complete list of “fakeness” in the MLS:

Team Followers % of “Real” Followers
Vancouver 308005 26
Toronto 290872 42
Atlanta 517134 48
Houston 343049 51
Montral 284534 61
Orlando 387925 63
San Jose 203446 63
Washington 117203 64
Seattle 407927 66
Portland 285669 73
Los Angeles 391280 74
NY Red Bulls 184798 76
Dallas 134959 77
Colorado 84481 77
Kansas City 297709 78
Chicago 139334 80
Philadelphia 109161 81
NYCFC 332828 83
New England 89855 84
Columbus 143143 85
Salt Lake 127778 87
Minnesota 60619 88

No surprise: all professional / celebrity accounts will draw some noise and attract the occasional bot follower. However, the result for Vancouver was quite shocking. Of their more than 300k followers, only 26% (or about 80k) are active enough to be counted as “real”. I almost hope that this is some form of a slip-up, but the engagement numbers would match the trend. For the 2nd time in a row, the Whitecaps are among the bottom three of the league, generating only .86 favorites and .6 retweets per 10000 followers. They were also voted as least appealing Twitter Account by Howler Magazine. Houston and Montreal, the only two teams with less fan engagement, also rank in the first quarter of the “fake followers” analysis. On the other side of the spectrum: Minnesota United deserves a big shout out. The expansion franchise seemingly chose the slow(er) route of organic growth on social media and now tops the Engagement Ranking for the 2nd time in a row — with a whopping margin.

2. Bye, bye – Facebook. I decided to leave Facebook out of the analysis. Both platforms are very different and lumping them together in one analysis is likely to confound the results. Instead, I decided to focus on Twitter — which became even richer from a data perspective given the addition of “Twitter Audit” and some planned further analyses.

Some interesting findings / thoughts.

  • Overall engagement and tweet frequency are up from the pre-season analysis. Which makes perfect sense given that game days are expected to a) see more tweets, and b) get fans much more involved.
  • Chicago Fire and L.A. Galaxy upped their game. Not sure if it is the “Schweinsteiger-Effekt” for Chicago or getting quite a bit of TV time for Los Angeles, but both teams jump significantly in the ranking.  They might have also simply upped their social media efforts for the season: L.A. was just voted as having the best memes in the MLS.
  • Can’t buy me love – or can I? Atlanta is somewhat of a conundrum. I just complained about their (presumably) artificially bolstered follower count and how that should diminish their fan engagement scores, and yet they rank among the top three for the 2nd time in a row. How can that be? We can’t be sure, but there are some possible explanations:
    • 1) Even without the suspected bots, United would still sport almost 250k followers – the 4th most in the MLS (when all teams are adjusted to their true follower count based on Twitter Audit data). So: There is quite some buzz surrounding the team — and maybe making the follower count look nice early on kick-started overall engagement. When we only look at this “core” group of followers and calculate engagement based on them, Atlanta comes in first. By far.
    • 2) I don’t want to suggest anything here – I really like how Atlanta has kick-started their campaign on the pitch as well as online – but one could also suspect that they invested in smart bots that could automatically like and share content instead of “dead” fake accounts. In the end, though, any brand engaging in such behavior would shoot itself in the foot. No matter how well a bot is programmed, unless you also train in to buy your merchandise and sit in the stands, the ROI simply won’t be there. Instead, you’ll have to explain why you seem to have a gazillion fans — that never buy anything.
  • Love thy fans! There is quite a variance in the amount of interactivity among teams and their fans – at least when taking replies (and retweets) on Twitter as a proxy. Seattle leads the reply charts: more than 28% of all original tweets were in reply to another Twitter account — compared to 4% for Orlando and Montreal. Looking at retweets, Salt Lake is king. Almost 39% of all tweets are re-tweets. On the other end of the spectrum: Philadelphia posts the most original content with “only” 9% of RTs.


[UPDATE] Following some further analyses and great feedback, I have adjusted the engagement formulas to include adjustments for playing games on national television, as well as market size.

nba most engaging teams on social media twitter and facebook

1. Getting the Data

1.1. Twitter

I use “R” to access Twitter’s REST API, which provides programmatic access to tweets, user profiles, follower data, etc.* Instead of pulling all the information manually, I use a script that downloads up to +/- 3200 of the most recent tweets made by the selected accounts (in this case: all NBA teams) including the info I am interested in (retweets, favorites, replies, etc.). Once the data is downloaded, I save it in individual .json files for further analysis.

* The REST API does not provide access to real-time data (we would need the STREAMING API ), but since I’m interested in accounts instead of ongoing conversations, the REST API works better.

1.2. Facebook

Again, I use R to access the API. The fantastic “Rfacebook” library allows downloading information about public posts from public pages. Instead of downloading a set number of posts, I restrict my data collection to a certain timeframe. In this case: the NBA season so far. More specifically, I include all posts made by the official Facebook pages of all NBA teams between the beginning of September 2016 (Naismith Memorial Basketball Hall of Fame Enshrinement) and the end of January 2017 (time of data collection).

2. Cleaning & Normalizing the Data

2.1. Followers / Fans

When looking at engagement on social media**, there are several confounding variables that need to be taken into account before starting any analysis. One of the most obvious (and also the easiest to fix) is the number of followers each team has. Logically, a team with more than 5.5 million followers (such as the Lakers) will naturally elicit more favorites and retweets than a team with ~ 560k followers (can people please start following the Utah Jazz), simply by having each tweet shown to a bigger audience. The same logic — of course — applies to Facebook, where the official Lakers page have ~ 22 million fans and Utah’s has about 1.2 million.

** The term “engagement” is used rather vaguely in both academia and the industry. For this analysis, I refer only to the behavioral component of engagement. Or rather: a crude proxy of it  — favorites and retweets for Twitter and Likes, Comments, and Shares for Facebook. To be perfectly clear here: This measure is rather a measure of the breadth than the depth of engagement because it does not tell us anything about “why” a Twitter user liked or favorited a tweet.

To create a level playing field, I need to control for the number of followers/fans each team has. As a first step, I divide the number of favorites, retweets (for Twitter), as well as likes, comments, and shares (Facebook) by the number of followers/fans each official team account/page has (I downloaded that information as part of the data collection process). However, since the resulting number (favorites/retweets/likes/comments/shares per single follower) is abysmal and meaningless in any practical sense, I multiply it by 10,000. Given the range of followers most NBA teams have, the resulting “per 10,000 followers”-variable provides a good starting point to compare fan reactions to tweets and Facebook posts across teams.

2.2. Replies

Another potentially confounding variable: replies. When a tweet starts with a @username (aka is a reply), the only users who will see it in their timeline (other than the sender and the recipient) are those who follow both the sender and the recipient. This reduces the potential audience (and therefore the potential for engagement) quite a bit. In other words: Teams with more replies in their data would be disadvantaged in any subsequent calculation of engagement (as measured in favorites and retweets – see below). Therefore, I separate the replies from the “original” content and only analyze the latter.

However, I don’t want to throw away that information. How teams reply to tweets – and therefore directly interact with followers – is a great separate indicator of fan engagement. Even though it is hard to quantify (and therefore not included in the engagement calculation), interacting directly with fans can be seen as a proxy for the effort/manpower each team puts into their social media strategies. Here are the most interactive NBA teams on Twitter:

  1. Portland Trail Blazers — 29.4%
  2. Memphis Grizzlies — 23.1%
  3. Sacramento Kings — 22.6%
  4. Denver Nuggets —19.4%
  5. Miami Heat — 16.5%
  6. Atlanta Hawks — 13.9%
  7. New Orleans Pelicans — 10.7%
  8. Orlando Magic — 10.2%
  9. Philadelphia 76ers — 10.2%
  10. Utah Jazz — 9.2%

2.3. All-Star Voting

To vote for their favorite player, users were encouraged to tweet, retweet or reply with a player’s first and last name or Twitter handle, along with the hashtag #NBAVOTE. Teams — as a way to promote their players — would then post tweets containing “#NBAVOTE” and encourage fans to retweet. As a result, teams with more popular players would likely receive more retweets. To reduce the potential effect of the All-Star Voting, I eliminated all tweets containing the “#NBAVOTE” hashtag from the dataset.***

*** I retained that information for the Facebook posts, though.

2.4. On-Field Success [update]

A team’s success is often the most powerful predictor of fan engagement. From a logical perspective, it’s much easier (and pleasant) to create content for a winning team and get people to like it than it is to pick up the pieces after losses (you don’t really “like” a loss, right?). In fact, I ran a regression model predicting fan engagement from a range of variables, and the current season record of a team emerged as the most powerful predictor. Why does this matter? Well, I’m mostly interested in “who does the best job on social media?” (as in: which team has the best social media folks) – and not in “whose fans are most excited for some other reason?”. As a result, I need to control for the effect on on-field success. To do so, I calculated how much each win contributes to the different engagement indicators. For example, each win is (on average) worth 148 likes on Facebook.

2.5. Television [UPDATE]

Social media and TV go well together. Twitter is often considered the premier 2nd screen medium in the realm of sport: teams promote their Twitter handles on their courts, Twitter promotes specific hashtags for events. As a logical consequence, teams with a greater presence on national television (ESPN, ABC, TNT) should by default generate more engagement. If all teams were to get equal TV time, we could just neglect this factor — but they don’t. At the time of data collection, the average team had been on national TV (excl. NBA TV & League Pass) about 7 times. However, while teams like the Warriors (22 games) and Clippers (17) have had plenty of time to “promote” themselves on air, the Nuggets and Nets (each 1), and Magic (0) don’t get that chance very often (thankfully, the NBA provides that type of data). Long story short: I ran a series of models to compute how much each TV appearance on ESPN, ABC, and TNT contributes to the overall engagement — and being on the telly matters quite a lot. For example: Every time your team plays on ESPN, you get an additional ~900 likes for your Facebook content.

2.6. Market Size [UPDATE]

This is a tough one. One of the biggest (theoretical) advantages of social media for all sorts of businesses is that it creates a somewhat level playing field. A small, family-owned business in Buford, Wyoming, can (theoretically) reach the same worldwide audience as a major corporation in New York City. The reality, however, looks a bit different. While social media has certainly opened up new avenues for smaller-market teams to flourish and reach fans beyond their traditional market, the sometimes dramatic differences in the home markets of NBA teams still matter. For example, teams in New York or Los Angeles (the two biggest media markets in the NBA) will have avery different “baseline media exposure” than the Memphis Grizzlies or New Orleans Pelicans in the two smallest TV markets in the league. Overall, the effect is not dramatic, but can certainly make a difference for some teams. For example: The Knicks will automatically get ~100 more comments on Facebook than the Portland Trailblazers just because of the market.

3. Calculating Engagement

What is a like worth? Or a retweet? Assigning values to user behavior is complicated. How do we know why an individual likes a piece of content? Well, we don’t. One might like a tweet because it is interesting. One might like a tweet to archive it. Or in hopes of being recognized by the creator of the content. Or because someone we care about cares about the content and we want to show that we care, too. In other words: We often can’t know if a user really cares about our content – or if (s)he is using our content as a relationship-building token or a virtual currency for social attention.

In any case, though, the general consensus is that fan engagement on social media matters. Some of my own research, for example, has shown that increased interactivity in form of comments on Facebook relates to traffic been referred to an organizations’ website. And even a “like” represents an individual’s engagement with the creator of the content. Even though a user might have liked it for some other reason, (s)he must have a) been exposed to it, and b) not too appalled by it to have it associated with their online identity.

Building on that argument, we can then start thinking about different degrees of engagement. A comment, for example, represents greater psychological (one has to think about what to comment) and physical (one has to actually type it out) effort than simply clicking the like button. As a consequence: One who comments must care more about the content when willing to exert this additional effort. Therefore, a comment should be “worth” more than a like when calculating fan engagement on social media. Finally, a share not only often represents an endorsement of the underlying content, but also expands the reach of the original post beyond the initial audience (connections of the one who shares might not follow the content creator) and should, therefore, be of even greater value to the content creator.

Based on this logic, I can assign weights to the individual proxies of fan engagement and calculate a single score across platforms. Is that score going to be a perfect representation of “how well” a team is doing on social media? No. Certainly not. The actual numbers are arbitrary and the resulting final score has no deeper practical meaning (you can’t buy anything for let’s say 90 Engagement), but they allow a normalized comparison across teams. People like — and often need — a simplified (key) performance indicator to evaluate their performance and allow a (crude) comparison with their competitors. This is what this score does. At least I hope it does. I call it:

Win-adjusted Normalized Engagement Score (or: WANE Score).

And this is what it looks like [UPDATE]:

In the first step, I adjust each individual engagement indicator by the major control variables identified above to adjust the scores for teams’ appearances on three major TV channels, their market size, and winning. For the TV and success adjustments, I take the league average as a standard and adjust every team towards that mean. For example, a team like the Warriors will lose likes, comments, etc. for each game they are over the league average for TV games and wins, and a team like the Nets will have points added.

However, not all teams can be assumed to benefit from these factors equally. A team putting relatively few resources into the creation of engaging social media content on a daily basis won’t get as big of a boost from an additional win than a team that is constantly developing new formats. To adjust for that (unknown) factor, I created an adjustment based on the baseline social media engagement ranking for each team and each channel:

Social Media Engagement Adjustment Formula NBA

Once I have adjusted all the individual indicators (Twitter = favorites, retweets; Facebook = likes, comments, shares ) based on this formula, I can use it to further calculate the overall engagement:

NBA social media engagement formula calulations

What this formula does is normalizing the adjusted average number of favorites and retweets (for Twitter) and likes, comments, and shares (for Facebook) by the number of followers each team has, then assigning weights to them following the logic explained above, and finally adding them up. In the final step, I combine the values for Twitter and Facebook and then normalize the score to engagement per 10,000 followers.


4. So what does all that mean?

Good question. Although we can’t take the WANE Score as an absolute value and measure of success, the calculations provide at least a starting point for comparing fan engagement on social media across teams. The results how dramatic differences in fan engagement within the NBA — and might give us an idea where to look for successful social media strategies.

Here are some high-level observations:

  • Posting frequency varies quite a bit across teams —- on both Twitter and Facebook. While the Orlando Magic only sent out 1530 eligible tweets since September 2016, several teams tweeted more than 3200 times (which was the maximum I could collect). On Facebook (where I could get all data independent of the number of posts), the average team published 563 posts (~ 3-4 posts per day). Still, there was quite some variance in the data. Memphis published the most content with 809 posts, the Lakers the least with 358 posts over the course of the season so far.
  • Some teams are very likeable – others not so much. The Warriors get about 26 likes per 10,000 Fans on Facebook and 3 favorites per 10,000 Followers on Twitter, which makes them the “most likeable” team in the league. By far. They lead the Cavaliers by about 9 points on the combined scale. Milwaukee comes in third, just ahead of Philadelphia, Houston, and San Antonio. On the other and of the scale: The Mavericks and Pistons on Facebook (with less than 3 likes per 10,000 fans), and the Pelicans and Magic on Twitter (with less than half a favorite per 10,000 followers).
  • “Most Viral” content. It’s the Warriors, again. Golden State generates about 2 retweets per 10,000 fans on Facebook, followed by Philadelphia (1.53) and Atlanta (1.35). On Twitter, Toronto stands out (3.66 retweets per 10,000 followers) – with the Cavs (2.80) and Sixers (2.68) to follow. Combined, the Warriors produce the “most viral” content, followed by the Sixers, Cavs, and Raptors. On the other and of the scale: The Nets, Heat, and Nuggets for Facebook — and the Magic (again), Heat, and Pistons (again) on Twitter. Combined, the Heat rank last. Just behind the Nuggets and Magic. All of the numbers above are “pure” (not adjusted for wins/TV time).
  • Content matters! Despite including a variety of variables in my calculations, a good portion of the variance has not been explained. My estimation right now is that at least between 20-30% of engagement depends on the actual content teams produce.
  • Average? On average, an NBA team generates 1.41 favorites and 1.31 retweets on Twitter. On Facebook, the league average is about 7.82 likes per 10,000 followers per post. Comments are much harder to get: on average, only one in about 200,000 followers will comment. Finally, per 10,000 followers, about .67 shares are generated.

The Language of Engagement

Figure 1. Average number of favorites and retweets across Twitter accounts
Figure 1. Average number of favorites and retweets across FCB Twitter accounts

Following my analysis of the languages spoken by #Copa100 fans on Twitter, somebody asked me: Does it even make sense to have language destinations if most people flock to the major account anyways? In other words: My resources are limited – so why put effort into crafting language-specific content when the majority of fans does not seem to care?

Good question.

The answer is: yes, language destinations make sense. A lot of sense.

And here is why: Although we don’t reach as many people with the additional accounts (the average “foreign language” account has about 63% fewer followers), the ones that we reach are usually more committed. And greater commitment means more engagement with our content — and ultimately a stronger bond with our brand. At least that’s the theory.

Are “international” fans really more engaged?

Take Bayern Munich, for example. Their main Twitter account (@FCBayern) has about 2,85 million followers. However, given the popularity and social significance of Bayern Munich in German society (games and player signings often serve as token for conversation), many followers are likely to be less committed (read: average sports fans that just want to stay up-to-date) and therefore consume information rather passively. For many followers, Bayern Munich might only be their 2nd or 3rd favorite club that they revert to when the club plays internationally. Following (the entertaining) @FCBayernUS, on the other hand, requires more commitment to soccer in general and Bayern in particular, as the sport and club are not “mainstream-topics” in the US. As a result, a more active audience should be expected. Similarly, fans of Chicharito Hernandez following the Spanish-language account of Bayer Leverkusen (@bayer04_es) should be more inclined to interact with content that is specifically tailored towards their interests.

Figure 2. FC Barcelona provides 9 language-specific Twitter accounts
Figure 2. FC Barcelona provides 9 language-specific Twitter accounts

But is the really the case? Testing my hypothesis, I compared a total of 14 language destinations — including those of two leagues (Bundesliga, MLS) and three clubs (Bayern Munich, Bayer Leverkusen, FC Barcelona). This is by no means a representative sample, but rather a purposive one. I chose Bayern mainly because of the “unusual” way they run @FCBayernUS. To engage fans in the US, the account features more entertaining content (informal language, GIFs, emojis, retweeting of user generated content) than most “traditional” team accounts. In theory, this should result in greater engagement. Similarly, the Spanish Leverkusen account (started in 2015 after signing Chicharito) provides content tailored to his fans. Furthermore, I chose the official Bundesliga accounts (German and English), to assess how the expanded international TV deals (especially in the US) affect engagement. Similarly, I was interested in potential differences between the English and Spanish accounts of the @MLS. Finally, I added three @FCBarcelona accounts — just because the club is probably the most extreme example of creating language destinations (see Figure 2). Also: The club’s main account is in English rather than Spanish (all other clubs and leagues in the sample use their “native” language for the main account). And: In contrast to most other entities, all Barcelona accounts tweet the exact same content (with very few exceptions). In other words: They do not tailor content towards specific audience segments, which might reduce the benefit of language destinations. Here is what I found:

Language destinations show more engagement

This slideshow requires JavaScript.

  • Teams get more engagement than leagues. Fans identify with their favorite club – not necessarily the league the club plays in.
  • Language destinations out-perform the “original” account. For all entities in the sample, the language-specific accounts received more favorites and retweets per 10,000 followers. The most impressive numbers come from @FCBayernUS (7 x more favorites; 10 x more retweets than @FCBayern) and Leverkusen’s international destinations.
  • It is easier to like than to share: All accounts received more favorites than retweets. This yields support for the argument that a retweet/share should be valued higher than a favorite/like when evaluating social media metrics. Favoriting a tweet involves lesser commitment and effort than retweeting and thereby endorsing a tweet and might be done for a different reason (e.g., archiving function, social token).
  • Content matters: Language-specific channels yield the biggest benefits when their content is specifically tailored towards the targeted audience segment. In other words, simply translating the “original” content is not enough. Language destinations designed around a specific purpose (e.g., a player, cultural engagement) tend to generate the most engagement.

Method: Some detail on the analysis

Data Collection: I accessed the Twitter API using the userTimeline function of the twitteR package in “R” to call up the timelines of the selected accounts. Using this method, Twitter limits the search to a relatively short period of time (usually between 1 – 3 weeks. However, I was able to go back until November 2015 for @Bayer04_es). Other methods (such as Pablo Barbera’s getTimeline function) allow downloading up to 3200 tweets, but showed inconsistencies for key variables during data collection. Therefore, I chose data-quality over sample size and defer the larger-scale analysis until later. Overall, I collected 6556 tweets across 14 accounts. The number of tweets per account ranged from a low of 88 (@MLS) to a high of 1639 (@FCBarcelona).

Analysis: Twitter provides two metrics that are commonly used as a proxy for user engagement by both industry and academia: favorites and retweets. Despite questions about the validity of these measures (e.g., does a favorite on Twitter really mean somebody engaged with your tweet – or is it a social currency acknowledging your relationship?) and uncertainties about their value (how much is a favorite worth – and how much more value should be attached to a retweet that actually increases your audience?), they a) still seem to be accepted as the industry standard, and b) are the ones I can easily measure automatically. To allow for direct comparison of all analyzed accounts, I normalized both engagement measures as averages per 10,000 followers. By doing so, @Bayer_EN (18k followers) and @FCBarcelona (17,8m followers) have a level playing field to compete on.

You can find some descriptive statistics here.

The #Copa100 tweets…

Figure 1. Top 5 languages spoken by individuals tweeting about the Copa America
Figure 1. Top 5 languages spoken by individuals tweeting about the Copa America

…in Spanish. But Tweets are likely to come from U.S.

With the #CopaAmerica well underway and the #Euro2016 kicking off as I type,  I had no choice but to dedicate this week’s episode of Professional Shenanigans to major international soccer tournaments and their representation on social media (i.e. Twitter).

After focusing on tweets about MLS teams and predicting their frequency by looking at characteristics of the state they originated in, I wanted this project to explore one of the most interesting aspects of major international sporting events: language. From a media perspective, managing the preferences of a multi-lingual audience is a challenge.

Knowing that soccer is a truly global sport, and realizing that social media is a low-cost but effective in-house solution for reaching this global audience, many teams have started internationalizing their Twitter content. For example, 14 out of 16 Bundesliga teams had at least two Twitter accounts — with one providing information in a language other than German (mostly: English). Some clubs even added further channels. Most notably, the fantastic @FCBayernUS popularizing the club in the US, and Bayer Leverkusen’s @bayer04_es (with more than 75k fans following updates on Chicharito).

Teams often have a clear rationale for creating specific language destinations (English as the accepted “lowest common denominator”, plus a language spoken by the fans of a major star playing for the team) — but how does a major tournament navigate this space? Just look at the #Euro2016: Almost every participating country has its own official language (in fact: I counted 19 languages for 24 participating teams).

The situation at the #Copa100 is far less dramatic. 13 out of 16 participating nations speak Spanish, so the priority should be clear. Then you add some English (for the USMNT as the host), as well as Portuguese (Brazil) — and you’re done. And that is exactly what the organizers did by creating three separate Twitter accounts. Interestingly, though, the main account (@CA2016 with more than 208k followers) is in English. Both “language destinations” trail considerably in followers (ES = 23k, PT = 1.5k), even though they were all established in April 2015 and tweet similar amounts of content. This is somewhat surprising, especially when looking at my language analysis (see Figure 1).

So the question becomes: Are Spanish-speaking Twitter users for some reason choosing the English account, or are the English-speaking Copa followers a silent majority? To find out, I looked at a subset of the @CA2016 followers (n=25,000; see Data Collection) and their language settings. Interestingly, the results mirror the distribution within tweets. The most frequently used language was Spanish (45%), followed by English (38%), French (2%), and Portuguese (2%). Arabic (1%) remained in the top 5. Overall, then, even though language destinations for Spanish and Portuguese exist, fans chose the English account.

Why: Identity? Quality? Originality?

Figure 2. Country of origin for subset of tweets (n=280)
Figure 2. Country of origin for subset of tweets including geo-location (n=280)

There are several potential explanations for fans’ preference for following a major sporting event in English rather than the language that they self-identify as native (at least based on their Twitter account). First, fans following international sporting events on social media (and especially in soccer) might be more likely to be fluent in English – at least enough to feel comfortable consuming English language content. Many fans are probably following their national teams’ stars in foreign leagues (MLS, Premier League, Bundesliga), and therefore consume English-language content to stay up-to-date. At the same time, these highly engaged fans might also perceive the content on the “bigger”  @CA2016 account to be more original and of higher quality (as the tournament is played in the U.S.).

At the same time, many fans with Spanish as their main Twitter language might actually live in the U.S. To test that hypothesis, I analyzed a subset of #Copa100 tweets that included geo-location (unfortunately, only 2,5% of tweets did). Still, the greatest portion of these tweets (~27%) originated in the U.S. (see Figure 2), showing some support for the argument that the Spanish-speaking population in the U.S. might follow the Copa in English.

As always: There is much more to analyze and I will use the #Copa100 and #Euro2016 as my playground for the weeks to come. Ideas, suggestions, and feedback are welcome!

How: This is how I got the data

Data collection: I accessed Twitter’s Streaming API using the StreamR package in R*. Search was limited to tweets mentioning either one of the two most popular hashtags (#Copa100, used by the official Copa America account; #CopaAmerica, pushed by Twitter) or the term “Copa America”. I ran three rounds of initial data collection, limiting each search to tweets published within a 30-minute time frame. Data was collected during the early afternoon hours, trying to avoid live updates during games (that’ll be a different study). The three searches returned a combined 10388 tweets. 

Of note: Even though data collection took place in the early afternoon while no games were played, the results are likely to reflect the scheduling of the tournament to a certain extent. Fans of the teams that had played the night before (Brasil vs. Haiti & Ecuador vs. Peru) or later that day (Uruguay vs. Venezuela & Mexico vs. Jamaica) should be more likely to talk about the tournament on Twitter during data collection.

For the second part of the analysis, I used the twitteR package to access information about the 25k most recent followers of the @CA2016 account and evaluated the most frequently set language within the returned accounts. Most followers (92,8%) were not verified and therefore likely non-media / athlete accounts with an average of 838 followers.

* If you’re interested in learning more about the package and how to use it to scrape tweets, I strongly recommend the work(shops) of Pablo Barbera

This is where the MLS gets the most Twitter love – and why!

MLS Tweets per State
Figure 1. Number of tweets about MLS teams per state (click image for interactive map).

The first question we have to answer on our way to MLS wisdom is: How do you measure the popularity of the MLS in a given state? There are a plethora of approaches, but given my research focus on Twitter, my recent experiments with data collection via R, and my desire to relate social media and real world data, I settled on Twitter mentions of MLS teams. No, this is not the perfect measure. Not even the best one, to be honest. Twitter users are not a representative sample of the overall population, and not even of the average MLS fan (Youtube, Facebook, and Instagram have higher adoption rates among MLS fans). Still, it is a measure that I am interested in. Twitter is highly relevant in sports. In fact, Twitter has the highest growth rate among social media platforms for MLS fans (not counting Snapchat where I didn’t have data). As a result, Twitter is extremely relevant for marketing purposes and users are actively pursued and engaged by the industry.

Figure 2. Where MLS teams get their the greatest Twitter engagement (adjusted for population)?
Figure 2. Where MLS teams get their Twitter engagement (adjusted for population)?

Data: So tweets it is. But how do we get them? To collect a sample suitable for analysis, I accessed the Twitter API via R and pulled all tweets using the @username of any of the 20 MLS teams for 300 seconds per team. I went with a team-centered approach here, because I assume greater engagement with teams compared to leagues. You’d rather say “I’m a fan of PhilaUnion” instead of saying “I’m a fan of the MLS”, right? We attach to people and teams more than we do to leagues – especially when we’re trying to interact. Similarly, using @usernames as a selection criterion instead of simple team mentions not only helped to reduce unwanted data (somebody mentioning the city or team name in an unrelated context), but also to ensured reaching a highly involved audience (you need to care about the team to know and use the @username). This resulted in ~250,000 tweets overall, varying slightly across teams. However, given my interest in comparing MLS teams’ popularity on Twitter across states, I only retained those tweets that contained geolocation. This turns out to be around 10% of all tweets and reduced the final sample to 25,307 tweets. As expected, the number of tweets per state varied heavily from a low of 24 in Wyoming to 3255 in California (see Figure 1), and between teams (see Figure 2).

Analysis: I ran an Ordinary Least Squares (OLS) regression model predicting tweets per state from a set of variables (see discussion below) in SPSS. I won’t go into all of the details here, but the model worked well — predicting about 99% of the variance in tweet volume per state (F = 469.61, df = 11, p < .001).

Results. Or: What predicts the popularity of the MLS?

To create a (somewhat) level playing field for our analysis, we need to account for some differences among states that would otherwise skew the results. First and foremost: population size. The more people live in a given state, the more people can (at least theoretically) tweet about the MLS. Take Wyoming, for example. Each of its 586,107 residents would have to be much more active on Twitter to reach the same number of tweets produced by the 39,144,818 people living in California. To address this issue, I entered population size (obtained from the U.S. Census) as a variable in the analysis.

Population size alone, however, is not enough. Just imagine a scenario in which all residents of California — for some hypothetical reason — had no access to the Internet? Then Wyoming would suddenly look pretty active on social media, right? So I looked up the percentage of residents in each state that has high-speed Internet access (again, the U.S. Census Bureau thankfully provides this information). As expected, Internet access matters. States with lower Internet penetration (e.g., Mississippi and Alabama with ~ 65%) have fewer tweets mentioning MLS teams than states with higher Internet penetration (e.g., New Hampshire and Massachusetts with more than 85%). It was not the strongest predictor in the model, but it surely matters. Especially when we consider that tweeting is an inherently mobile activity (83% of users are active via mobile devices) and that using mobile Internet adoption statistics in a potential follow-up might be even more relevant.

Liverpool fans celebrate at Anfield
Figure 4. European soccer culture: Liverpool F.C. fans celebrate at Anfield

Having these two potentially confounding variables accounted for, it is time to move on. When thinking about the people behind the tweets, you also think about the history and current state of soccer itself. Soccer is often regarded as a somewhat “foreign” sport to many Americans that attracts more of an international audience. And when I say “international”, I’m thinking of two regions in particular: Europe and Latin America. Both are generally regarded as soccer-crazy (dominant soccer culture, all World Cup winners are either from Europe or South America), and relevant to the U.S. market based on immigration patterns. So might a state with more immigrants from one or both of these regions tweet more about the MLS? Yes and no. I went back to the Census data and looked for the number of people residing in each state that was not U.S. citizens at birth but came from either Europe or Latin America (South America, Central America, Mexico, and the Caribbean). Then, I calculated the proportion of the overall state population that is made up by Europeans (ranging from ~0% to ~4%) and Latinos (ranging from ~0% to ~15%). Albeit both geographic regions’ undeniable soccer craze, the amount of Europeans living in a given state does not matter for the amount of MLS tweets. Latin America’s love for the game, though, definitely translates to the U.S. The more individuals from South America, Central America, Mexico, and the Caribbean reside in a given state, the more tweets about the MLS originate in that state. But why is this the case? Maybe Europeans are hanging on to their “home clubs” (and have more opportunities to follow them on TV), while the Latin American community has started embracing the MLS (due to a greater influx of players from the region?). But that’s just a first guess.

Another indicator that might explain the MLS’ popularity on Twitter in a given state is the percentage of students playing soccer in high school. My rationale here goes something like this: If soccer becomes an integral part of young peoples’ lives (and I take playing soccer in high school as a very crude proxy for this process), then it might lead to greater interest in the game later in life (and supposedly some interest in the MLS that would be measurable in terms of teams’ Twitter mentions). To put a number on this, I went to the National Federation of State High School Associations and downloaded their complete “participation statistics”. Then I calculated the proportion of students who play soccer from the overall number of active high school athletes in each state. For example, in Virginia and Massachusetts about 16% of high school students play soccer, while in South Dakota that number plummets down to about 4%. Sounds interesting, right? Right. But it has no relation to the number of tweets mentioning MLS teams whatsoever.

Figure 4. Overall, only 2 percent of high school athletes are awarded some form of athletics scholarship to compete in college.
Figure 4. Overall, only 2 percent of high school athletes are awarded some form of athletics scholarship to compete in college. Source: NCAA.

Let’s move on. Anecdotal evidence suggests that soccer is becoming a “white collar” sport in the U.S. (as strange as this may sound). In fact, 39% of MLS fans have a reported household income of more than $100,000, compared to 25.2% of NFL fans. Similarly, kids that keep playing soccer (throughout high school and college) tend to come from higher socioeconomic backgrounds. At least that’s what youth coaches keep telling me. According to their accounts, children from lower socioeconomic backgrounds drop out of soccer at a higher rate when other, more competitive options (especially football and basketball in high school) arise. The rationale behind this is two-fold. First, soccer is still considered the less lucrative career choice by many Americans. However, this is (by most accounts) false. Yes, there are more football scholarships than soccer scholarships at the university level – and they usually “pay” better. So this is true. But: The chances of getting one of these scholarships and moving from high school to NCAA competition is equal for both sports (about 6%). The same is true for the chance of moving from NCAA to the professional leagues (2%). So there is no difference (see Figure 4). And this only includes the U.S. When you look at the global picture (and whenever you talk about soccer, you should), there are far more professional soccer players than football players. The second argument for the potentially skewed demographics of soccer players is closely related, but taps more into the cultural dimension of football: The higher the socioeconomic status, the greater the concerns about health/safety of children and the willingness to give up the culture of football. No matter if you personally agree with this, adding a variable tapping into this dynamic seems to make sense. Actually, I used two: The average household income per state and the proportion of “professionals and managers” among the workforce per state (both based on Census data and neatly compiled by the Kaiser Family Foundation). Both measures are obviously related (managers tend to earn more than service workers), but they are still different enough to be in the same model. Average household income ranged from ~ $35,000 in Mississippi to ~ $75,000 in Maryland; with 33% professionals and managers in (again) Mississippi to 57% in D.C. Long story short: Results of the regression model indicate further support for the “soccer = more wealth”-hypothesis. Even though household income itself failed to reach statistical significance, the proportion of professionals and managers in a given state was one of the most powerful predictors of MLS team mentions on Twitter. This is an interesting dynamic that certainly warrants further investigation — especially considering the cultural and market-related implications. We have to consider, though, that Twitter users, in general, tend to be more educated and wealthy. So it remains to be seen if this dynamic is unique to soccer or a general pattern.

When you talk about sports, you might also talk about the physical constitution of fans. Why? Well, you could argue that people who love soccer would also want to play soccer and therefore need to be in (somewhat) decent physical condition. As a result, a state with a higher number of “fit” residents could tweet more about the MLS. Unfortunately, you can also argue the exact opposite. If you believe all soccer fans look like THIS, you might also believe that they only have the time to tweet about the MLS in the first place because they’re not capable of playing themselves. To find out which argument (if any) finds support, I went to the Centers for Disease Control and Prevention (CDC) website and pulled state-by-state data on obesity (percentage of residents with a BMI of 30 or higher), and physical exercise (percentage residents who participated in any physical activities or exercise during the past month). The results are somewhat mixed. Unfortunately for me as a soccer fan, we seem to look more like THIS. The less fit a given state, the more MLS tweets are sent. Even though obesity rate by itself did not reach statistical significance (p = .07 with p < .05 being the accepted standard for significance), physical activity did (big time): The more “active” people there are in a state, the less they tweet about the MLS. To be honest, I am not sure what this means. However, it tells me to continue my research into the relationship between sport, social media, and health (two studies underway).

Local tweets about MLS team Philadelphia Union
Figure 5. Geo-information of “local” tweets about MLS team Philadelphia Union

Finally, I turned to geographical proximity. More specifically, I asked: Does having an MLS team in “your” state make you more likely to tweet about the team? It makes sense to believe that a greater percentage of core fans (assumed to be the most active) would live in somewhat close proximity to their team. This might be especially true if the soccer team is the only major sports franchise around. For example, a sports fan in the state of California has to decide between 18 Big Six (NLF, MLB, NBA, NHL, MLS, CFL) teams, whereas a sports fan in Oklahoma only has one major franchise to focus on. Looking into this dynamic, I compiled a list of Big Four teams per state as well as the number of MLS teams per state and entered both as variables into the model. Unfortunately, both did not matter for the amount of MLS tweets. This could mean that a) MLS fans don’t care about proximity or competition, or that b) my measure is off. I’d probably go with b). I have to admit that the “franchise per state” variable is not the best measure of either proximity or competition. Take Kansas City, for example. Just because the Chiefs, Royals, and Sporting KC are on the Missouri-side of the city does not mean that no one in Kansas cares about them. In fact, many people in Kansas (such as residents of Topeka) live in closer proximity to these teams than lots of people living in Missouri. Same is true for competition. A more detailed analysis is on order. Some initial investigations (see Figure 5), indicate at least a certain concentration around the teams’ locations.

So: What do we learn from this?

We know that California tweets a lot about the MLS. And we know that this is largely due to the states’ large population and high Internet adoption. These two factors are more or less given and don’t provide many actionable insights when looking at this from a marketing standpoint (teams can hardly shove more residents into a given state or provide high-speed Internet access in rural areas). However, we also know that the influence of Latin American fans is huge. Reaching out to this demographic and fostering opportunities to connect (maybe even intensifying the recruitment of players from Latin America or – even better – Latin American immigrants) could be highly rewarding. In addition, there seems to be a strong connection between the “white collar” population of a state and soccer tweets. Exploring this relationship further might also yield potential angles for marketing campaigns directly aimed at this demographic. On the flip-side, this also means that “educational” efforts, highlighting the “benefits” of soccer to currently less affectionate demographics might be a way of increasing fandom. However, as a final disclaimer, all of these results and interpretations are speculative/exploratory in nature and should only serve as a source of inspiration and foundation for further inquiry. Due to the nature of the underlying data, causation can not be inferred.

What else? Your feedback here! I am always looking for feedback and ideas to investigate further. So if you have suggestions, please let me know.

Shenanigans? Shenanigans!

Shenanigans [SHəˈnanəɡənz], professional. The Oxford Dictionary describes shenanigans as “silly or high-spirited behavior; mischief”. Although the activities described in this section are certainly not meant to be “secret or dishonest”, they are in most cases not what I would usually publish as an academic. Instead, they often are the result of a high-spirited or slightly mischievous idea, a general curiosity, or questions about sport, (social) media, and society that have come up at one point or another. If you have ideas/suggestions or would like to see any shenanigans applied to your commercial or academic endeavor, please don’t hesitate to contact me.