Real-World Production Issues with Recommendations
Most books and articles on recommendation focus purely on the technical aspects of recommendation systems. This book is about pragmatism, and so there are some issues to talk about when it comes to recommendation systems. A few of these topics are covered in this section: performance, ETL, user experience (UX), and shills/bots.
One of the most popular algorithms as discussed is O(n_samples^2 * n_features) or quadratic. This means that it is very difficult to train a model in real time and get an optimum solution. Therefore, training a recommendation system will need to occur as a batch job in most cases, without some tricks like using a greedy heuristic and/or only creating a small subset of recommendations for active users, popular products, etc.
When I created a user follow recommendation system from scratch for a social network, I found many of these issues came front and center. Training the model took hours, so the only realistic solution was to run it nightly. Additionally, I later created an in-memory copy of our training data, so the algorithm was only bound on CPU, not I/O.
Performance is a nontrivial concern in creating a production recommendation system in both the short term and the long term. It is possible that the approach you initially use may not scale as your company grows users and products. Perhaps initially, a Jupyter Notebook, Pandas, and scikit-learn were acceptable when you had 10,000 users on your platform, but it may turn out quickly to not be a scalable solution.
Instead, a PySpark-based support vector machine training algorithm (http://spark.apache.org/docs/2.1.0/api/python/pyspark.mllib.html) may dramatically improve performance and decrease maintenance time. And then later, again, you may need to switch to dedicated ML chips like TPU or the NVIDIA Volta. Having the ability to plan for this capacity while still making initial working solutions is a critical skill to have to implement pragmatic AI solutions that actually make it to production.
Real-World Recommendation Problems: Integration with Production APIs
I found many real-world problems surface in production in startups that build recommendations. These are problems that are not as heavily discussed in ML books. One such problem is the “cold-start problem.” In the examples using the Surprise framework, there is already a massive database of “correct answers.” In the real world, you have so few users or products it doesn’t make sense to train a model. What can you do?
A decent solution is to make the path of the recommendation engine follow three phases. For phase one, take the most popular users, content, or products and serve those out as a recommendation. As more UGC is created on the platform, for phase two, use similarity scoring (without training a model). Here is some “hand-coded” code I have used in production a couple of different times that did just that. First we have a Tanimoto score, or Jaccard distance, by another name.
"""Data Science Algorithms""" def tanimoto(list1, list2): """tanimoto coefficient In : list2=['39229', '31995', '32015'] In : list1=['31936', '35989', '27489', '39229', '15468', '31993', '26478'] In : tanimoto(list1,list2) Out: 0.1111111111111111 Uses intersection of two sets to determine numerical score """ intersection = set(list1).intersection(set(list2)) return float(len(intersection))/(len(list1)) + len(list2) - len(intersection)
Next is HBD: Here Be Dragons. Follower relationships are downloaded and converted in a Pandas DataFrame.
import os import pandas as pd from .algorithms import tanimoto def follows_dataframe(path=None): """Creates Follows Dataframe""" if not path: path = os.path.join(os.getenv('PYTHONPATH'), 'ext', 'follows.csv') df = pd.read_csv(path) return df def follower_statistics(df): """Returns counts of follower behavior In : follow_counts.head() Out: followerId 581bea20-962c-11e5-8c10-0242528e2f1b 1558 74d96701-e82b-11e4-b88d-068394965ab2 94 d3ea2a10-e81a-11e4-9090-0242528e2f1b 93 0ed9aef0-f029-11e4-82f0-0aa89fecadc2 88 55d31000-1b74-11e5-b730-0680a328ea36 64 Name: followingId, dtype: int64 """ follow_counts = df.groupby(['followerId'])['followingId']. count().sort_values(ascending=False) return follow_counts def follow_metadata_statistics(df): """Generates metadata about follower behavior In : df_metadata.describe() Out: count 2145.000000 mean 3.276923 std 33.961413 min 1.000000 25% 1.000000 50% 1.000000 75% 3.000000 max 1558.000000 Name: followingId, dtype: float64 """ dfs = follower_statistics(df) df_metadata = dfs.describe() return df_metadata def follow_relations_df(df): """Returns a dataframe of follower with all relations""" df = df.groupby('followerId').followingId.apply(list) dfr = df.to_frame("follow_relations") dfr.reset_index(level=0, inplace=True) return dfr def simple_score(column, followers): """Used as an apply function for dataframe""" return tanimoto(column,followers) def get_followers_by_id(dfr, followerId): """Returns a list of followers by followerID""" followers = dfr.loc[dfr['followerId'] == followerId] fr = followers['follow_relations'] return fr.tolist() def generate_similarity_scores(dfr, followerId, limit=10, threshold=.1): """Generates a list of recommendations for a followerID""" followers = get_followers_by_id(dfr, followerId) recs = dfr['follow_relations']. apply(simple_score, args=(followers,)). where(dfr>threshold).dropna().sort_values()[-limit:] filters_recs = recs.where(recs>threshold) return filters_recs def return_similarity_scores_with_ids(dfr, scores): """Returns Scores and FollowerID""" dfs = pd.DataFrame(dfr, index=scores.index.tolist()) dfs['scores'] = scores[dfs.index] dfs['following_count'] = dfs['follow_relations'].apply(len) return dfs
To use this API, you would engage with it by following this sequence.
In : follows import * In : df = follows_dataframe() In : dfr = follow_relations_df(df) In : dfr.head() In : scores = generate_similarity_scores(dfr, "00480160-0e6a-11e6-b5a1-06f8ea4c790f") In : scores Out: 2144 0.000000 713 0.000000 714 0.000000 715 0.000000 716 0.000000 717 0.000000 712 0.000000 980 0.333333 2057 0.333333 3 1.000000 Name: follow_relations, dtype: float64 In : dfs = return_similarity_scores_with_ids(dfr, scores) In : dfs Out: followerId 980 76cce300-0e6a-11e6-83e2-0242528e2f1b 2057 f5ccbf50-0e69-11e6-b5a1-06f8ea4c790f 3 00480160-0e6a-11e6-b5a1-06f8ea4c790f follow_relations scores 980 [f5ccbf50-0e69-11e6-b5a1-06f8ea4c790f, 0048016... 0.333333 2057 [76cce300-0e6a-11e6-83e2-0242528e2f1b, 0048016... 0.333333 3 [f5ccbf50-0e69-11e6-b5a1-06f8ea4c790f, 76cce30... 1 following_count 980 2 2057 2 3 2
This “phase 2” similarity score-based recommendation with the current implementation would need to be run as a batch API. Additionally, Pandas will eventually run into some performance problems at scale. Ditching it at some point for either PySpark or Pandas on Ray (https://rise.cs.berkeley.edu/blog/pandas-on-ray/?twitter=@bigdata) is going to be a good move.
For “phase 3,” it is finally time to pull out the big guns and use something like Surprise and/or PySpark to train an SVD-based model and figure out model accuracy. In the first part of your company’s history, though, why bother when there is little to no value in doing formal ML model training?
Another production API issue is how to deal with rejected recommendations. There is nothing more irritating to a user than to keep getting recommendations for things you don’t want or already have. So, yet another sticky production issue needs to be solved. Ideally, the user is given the ability to click, “do not show again” for a list of recommendations, or quickly your recommendation engine becomes garbage. Additionally, the user is telling you something, so why not take that signal and feed it back into your recommendation engine model?