Counting your followers. How Medium counts your followers and… | by An Vu | Nov, 2020


How Medium counts your followers, and the backfills that disrupted it

Photo: Crissy Jarvis on Unsplash

recent wave of users have questioned the validity of the displayed number of followers on their profiles. At one point, their follower number increased, some even by the thousands, but then a few weeks later went back down to around where the user originally remembered. Some users were concerned: Were their profiles hacked? Did a wave of users suddenly follow and then unfollow them? The answer is no—it was a data problem.

When Medium started to shift toward more relational content, we noticed that some users were missing authors in their “Latest From Following” section, which should show the followed collections and authors that had recently published. Since this wasn’t true for all users, we had to look into the data that drives this section of Medium.

Upon investigation, we noticed a discrepancy in a key column that the “Latest From Following” section uses that the rest of Medium does not. In turn, a backfill had to be done so the key column would properly reflect a user’s followings. After the backfill finished, users started reporting an uptick in followers and the number of people they have followed, even though they had not followed any new users. That’s when we noticed that the original backfill had done its intended purpose while also breaking it.

A backfill is supposed to be the process in correcting data or making the data whole without modifying its existing core components. The first backfill did fix the key column needed for the table that “Latest From Following” depended on, but it also ended up modifying another table that contained the displayed number of followers on a user’s profile. Due to this error, another backfill had to be done to bring the numbers back to the users’ real amount of followers and followings.

To get some context, let’s do a deep dive of what currently happens when you follow a user, the error that prompted the need for the backfill, what was done for the first backfill and why it failed, and the second backfill that fixed it.

Three things ultimately happen when you press that green button:

  1. You create or recreate a Medium relationship with that user. In this context, a Medium relationship is an indication of whether or not a user is following another, regardless of which platform the follow originated from. (More on this later.)
  2. You increase the number of users you follow.
  3. You increase the number of followers for the followed user.

Behind the scenes

There are two tables in AWS DynamoDB that formulate the data: the user relationship table and the user stats table.

The user relationship table has the following columns:

  • user_id (the user doing the following)
  • target_user_id(the writer who was followed)
  • twitter_followed_at(the timestamp of import from Twitter)
  • facebook_followed_at(the timestamp of import from Facebook)
  • medium_followed_at(the timestamp of follow on Medium platform)
  • medium_unfollowed_at(the timestamp of unfollow on Medium)
  • latest_followed_at (the aggregation of all the timestamp columns)

The user stats table has the following columns:

  • user_id (the user in question)
  • followers (the number of users the given user follows)
  • followees (the number of users who follow the given user)

Current scenario

When a user chooses to follow an author, an entry is created for that relationship, if it doesn’t exist already, in the user relationship table. This relationship would fill the current time into the medium_followed_at column, as well as the latest_followed_at column. Additionally, in the user stats table, the number of followers for the author and the number of follows for the user who clicked the follow button is incremented by one.

When a user chooses to unfollow an author, the existing entry for the relationship is updated. In the user relationship table, latest_followed_at is set to zero, while medium_unfollowed_at is set to the time the user unfollowed. In the user stats table, the number of followers for the author and the number of follows for the user who clicked the follow button is decremented by one.

The code for the logic looks something like the following:

putUserRelationship(
source_user_id: string,
target_user_id: string,
social_type: string
created_at: number
): Promise<boolean> {

// setting up the data
const relationship: Object = {
user_id: user_id,
target_user_id: target_user_id,
}
switch (social_type) {
case "TWITTER":
relationship.latest_followed_at = created_at
relationship.twitter_followed_at = created_at
case "FACEBOOK":
relationship.latest_followed_at = created_at
relationship.facebook_followed_at = created_at
case "MEDIUM":
relationship.latest_followed_at = created_at
relationship.medium_followed_at = created_at
case "UNFOLLOW":
relationship.latest_followed_at = 0
relationship.medium_unfollowed_at = created_at

// This function puts the data into the user relationship table
return this._socialData
.putUserRelationship(
relationship,
)
//this part updates the user stats table
.thenBound(this._updateUserCounts, this)
.then((): boolean => {
return true
})
}

The function takes in the user_id, target_user_id, the social_type (Twitter, Facebook, Medium, or Unfollow), and the current time as an epoch timestamp, which is essentially the number of seconds elapsed since January 1, 1970, at midnight (00:00:00) in the UTC time zone. The function then ascertains which user relationship table columns to fill based on the social type and places the data into that table. Afterward, it takes the same user_id and target_user_id to updates the users’ followees and followers, respectively.

Prior to 2016, users had the ability to import their Facebook friends and Twitter follows to Medium. Upon import, a relationship is created for the two users and the related columns of twitter_followed_at and facebook_followed_at were updated to the time of import. The user stats were also incremented by the number of imports. At this point in time, the latest_followed_at column did not exist, and Medium relationships were calculated in the backend by grabbing the largest timestamp from twitter_followed_at, facebook_followed_at, and medium_followed_at, and making sure the aggregated timestamp is greater than medium_unfollowed_at. The code to calculate whether or not the user of the given user_id is following the given target_user_id in the relation looks like the following:

isFollowingRelation(relation) {  
const followedAt = Math.max(relation.medium_followed_at,
relation.twitter_followed_at,
relation.facebook_followed_at)
|| 0
return !relation.medium_followed_at ||
followedAt > relation.medium_unfollowed_at
}

This function takes in a relation that has all the data for a single Medium relationship in the user relationship table (user_id, target_user_id, twitter_followed_at, facebook_followed_at, twitter_followed_at, medium_unfollowed_at) and returns True if the user in question (user_id) is following the given author (target_user_id) or False if they aren’t. To this day, Medium still uses the code above to aggregate the user relationships that are shown on your follow pages. The top number, indicating number of followers, is taken from the user stats table while the list of followers is taken from the user relationship table.

Ev’s Medium Followers

Following the removal of Twitter and Facebook imports, the engineering team needed a faster way to keep track of Medium relationships. Thus, latest_followed_at was born and is supposed to be the source of truth for a Medium relationship; the single column would make it easier to ascertain whether or not the relationship between two users exist.

With this new column, a backfill was ran to populate it. latest_followed_at was supposed to be populated with the max of the followed_at timestamps (twitter_followed_at, facebook_followed_at, medium_followed_at) or zero. latest_followed_at is zero only if the max of the followed_at timestamp is less than medium_unfollowed_at.

Some examples:

  • John imported their following of Jane from Twitter on November 21, 2020, at 5:30PM UTC. This means their twitter_followed_at is set to November 21, 2020, at 5:30PM UTC. Their latest_followed_at would be on November 21, 2020, at 5:30PM UTC.
  • Joe imported their following of John from Facebook on November 20, 2020, at 10:34AM UTC but then unfollows John on November 21, 2020, at 3:12PM UTC. This means their facebook_followed_at is set to November 20, 2020, at 10:34AM UTC, and medium_unfollowed_at is set to November 21, 2020, at 3:12PM UTC. With the new column, latest_followed_at is set to zero since twitter_followed_at and medium_followed_at are zero by default and medium_followed_at is greater than facebook_followed_at.
  • Jane imported their following of Joe from Twitter on November 10, 2020, at 8:45AM UTC, then unfollows Joe on November 20, 2020 at 10:50AM UTC, and finally refollows Joe on November 20, 2020 at 12:00PM UTC. This means their twitter_followed_at is set to November 10, 2020, at 8:45AM UTC; medium_unfollowed_at is set to November 20, 2020, at 10:50AM UTC; and their medium_followed_at is set to November 20, 2020, at 12:00PM UTC. With the new column, latest_followed_at is set to the value of medium_followed_at. The max of the followed_at timestamps is medium_followed_at, and since medium_followed_at is greater than medium_unfollowed_at, latest_followed_at is set to medium_followed_at.

While developing the new interface, we noticed that “Latest From Following” was missing authors or had authors that shouldn’t be there. With the new relational Medium, we became heavily reliant on the user relationship table and the latest_followed_at field and realized that’s where the problem was. While the relationships existed, the data was not properly consolidated; some of the Twitter, Facebook, and Medium followed_at fields were not ported over to the latest_followed_at field, and some unfollows were not properly reflected. This may have been because the backfill to fill the column was incomplete, there might have been some migration issue where the backfill was run before the social media imports were turned off, and maybe there was a backend code hiccup where the unfollows didn’t zero out the latest_followed_at field. Regardless, with this finding, a backfill was in order to set the column straight.

To start the backfill, we needed to query the existing data and find the Medium relationships that needed to be fixed. Given that the data is found within DynamoDB, querying millions of entries on nonindexed columns is next to impossible. As such, we have hourly batch jobs that port the data over into Snowflake, our data warehouse. With the data in Snowflake, querying the data becomes less of a burden.

Getting the data

First, we queried the database for all user/author pairs that didn’t have the Twitter, Facebook, or Medium follows ported over into the latest_followed_at column. We were looking for any relationships where latest_followed_at is set to zero even though the max of the followed_at timestamps is greater than medium_unfollowed_at. The query looked like the following:

with new_follow_updates as (
select
user_id,
target_user_id,
medium_unfollowed_at,
GREATEST(
twitter_followed_at,
facebook_followed_at,
medium_followed_at) as followed_at
from user_relations
where latest_followed_at = 0
and followed_at > 0
)
select user_id, target_user_id from new_follow_updates where followed_at > medium_unfollowed_at;

We ran a second query to find all the user/author pairs that didn’t have the proper unfollow relationships. The erroneous relationship would be one where medium_unfollowed_at is greater than the max of the followed_at timestamps but latest_followed_at is not zero. The query looked like the following:

select 
user_id,
target_user_id
from user_relations
where
medium_unfollowed_at > 0
and latest_followed_at > 0
and medium_unfollowed_at > latest_followed_at;

The queries returned more than 180M rows combined, and the resulting datasets of the user_id and target_user_id were sent to S3 as comma-separated value (CSV) files that we will then use for backfilling the data.

Backfill workflow.

Each row in the CSV files is sent to Amazon’s Simple Queuing Service (SQS) as an individual message. The messages are then picked up by our event processors. The event processor takes in theuser_id and target_user_id and grabs the relation from the database. The processor does another computation of the latest_followed_at to make sure what we found in Snowflake is still the case in Dynamo. Based on the value of the calculatedlatest_followed_at, the processor utilized the existing Medium codebase and called the function mentioned above (putUserRelationship) to update the user relationship table appropriately.

Problem

The function we used in the backfill had the following return statement:

return this._userSocialData
.putUserRelationships(
relationships,
)
.thenBound(this._updateUserCounts, this) <--- Problem call
.then((): boolean => {
return true
})

This updates the user relationship table and subsequently takes the user_id and target_user_id, which we had just performed on, and updates their follow counts via the function call this._updateUserCounts.

This means that for the 180M relations we had updated, we had also accidentally incremented that many follow stats for Medium users.

Even though the user relationship table originally had discrepancies in data, the user stats table did not. With the completion of this backfill, we had corrected the errors in user relationship table but introduced errors in the user stats table. We had flipped the tables, so to speak.

With the newly created issue, we had to run another backfill to rectify the data. Since the user relationship table in Snowflake was updated hourly to properly reflect the table in DynamoDB and it had been a couple weeks since the initial backfill, the list of the original 180M relations that we had fixed were lost.

So, what did we do? We couldn’t take the number of followers and followees from the user relationship table at face value since they can be ephemeral; it does not account for any possible new follows or unfollows that may happen between the batch job into Snowflake and the time the backfill updated the data in Dynamo. If we had taken the numbers from the user relationship table as is, we would have potentially created more errors in the user stats table. Instead, we took the difference of followers and followees between the user relationship table and user stats table and added the difference to the user stats table in the backfill. By adding the difference to the data to the user stats table, we will undo the accidental increments from the first backfill. To simplify, it would look something like:

// This data will go into s3
followees_diff = user_stats_followees - user_relationships_followees
followers_diff = user_stats_followers - user_relationships_followers
// This calculation is done in the processor and goes into the database
final_followers = user_stats_followers + followers_difference
final_followees = user_stats_followees + followees_difference

The query to find the difference looked something like the following:

with followees as (
select
user_id,
count(*) as followees
from user_relations
where
latest_followed_at > 0
group by 1
),
followers as (
select
target_user_id as user_id,
count(*) as followers
from user_relations
where
latest_followed_at > 0
group by 1
),
differences as (
select
user_id,
fe.followers as followers,
fr.followees as followees,
fe.followers - us.followers as diff_followers,
fr.followees - us.followees as diff_followees
from user_stats us
join followers fe using(user_id)
join followees fr using(user_id)
where
abs(diff_followers) > 0
or abs(diff_followees) > 0
order by diff_followers, diff_followees desc
)
select
user_id,
diff_followers,
diff_followees
from differences

Essentially this query finds the true number of followers and followees for a user from the user relationship table based on the fixed latest_followed_at and subtracts the current user stats from the user stats table. The results for a user who had their number of followers increased from the first backfill might look like this:

user_id | diff_followers | diff_followees
------------------------------------------
abcde123| -10324| 0

This means when the user is passed to the second backfill, their number of followers in the user stats table will decrement by 10,324.

Similar to the initial backfill, this one goes through the same workflow. For this backfill, the user_id, number_of_followers, andnumber_of_followees are sent as a message, and the processor will take the three fields and update the user stats table. Again, we used existing code that would update the table with the delta. We made sure there were no follow-up callbacks this time around.

With the completion of the second backfill, all numbers on a user’s profile now properly reflect their relationship with other users.

The main learnings from this:

  • Thoroughly look through your code to make sure your intent is going to be executed.
  • Save the data prior to the backfill as a keepsake of your adventures (and in case anything goes awry).
  • Double check that your backfill fixed/completed the data as intended.

Go through your draft; check your code. Double check that your intent is clear and well executed. If the existing function putUserRelationship was not used the way it did, the false increase in the user stats table would not have happened (beware of callbacks!) and the second backfill would not have been warranted.

Save the shitty first draft, the data prior to the backfill. This data is what you worked from and what you’re trying to improve. Keeping it for after the backfill completion allows for data comparison; we want to make sure all the rows in the first draft are modified after the backfill. With our backfill endeavor, the initial data would have been helpful with the second backfill and to calculate the difference without pulling in data from the user stats table.

Do a final look through; make sure the backfill has completed/fixed the data as intended. One reason we had to run the backfill that conked the numbers was because the initial backfill to complete the “new column” latest_followed_at was incomplete. There were some relationships where the latest_followed_at column did not properly reflect the aggregation of the other timestamp columns.

Treat your backfill like you do your Medium posts: Write, check, save, and check one last time. Make sure to hit the “follow” button and watch the logic in action.



Source link