Using AI to Interact with User-Generated Content

By Noah Gift
Sep 19, 2018

📄 Contents

␡

The Netflix Prize Wasn’t Implemented in Production
Key Concepts in Recommendation Systems
Using the Surprise Framework in Python
Cloud Solutions to Recommendation Systems
Real-World Production Issues with Recommendations
Cloud NLP and Sentiment Analysis
Summary

⎙ Print

< Back Page 5 of 7 Next >

This chapter is from the book 

Pragmatic AI: An Introduction to Cloud-Based Machine Learning

Learn More Buy

Real-World Production Issues with Recommendations

Most books and articles on recommendation focus purely on the technical aspects of recommendation systems. This book is about pragmatism, and so there are some issues to talk about when it comes to recommendation systems. A few of these topics are covered in this section: performance, ETL, user experience (UX), and shills/bots.

One of the most popular algorithms as discussed is O(n_samples^2 * n_features) or quadratic. This means that it is very difficult to train a model in real time and get an optimum solution. Therefore, training a recommendation system will need to occur as a batch job in most cases, without some tricks like using a greedy heuristic and/or only creating a small subset of recommendations for active users, popular products, etc.

When I created a user follow recommendation system from scratch for a social network, I found many of these issues came front and center. Training the model took hours, so the only realistic solution was to run it nightly. Additionally, I later created an in-memory copy of our training data, so the algorithm was only bound on CPU, not I/O.

Performance is a nontrivial concern in creating a production recommendation system in both the short term and the long term. It is possible that the approach you initially use may not scale as your company grows users and products. Perhaps initially, a Jupyter Notebook, Pandas, and scikit-learn were acceptable when you had 10,000 users on your platform, but it may turn out quickly to not be a scalable solution.

Instead, a PySpark-based support vector machine training algorithm (http://spark.apache.org/docs/2.1.0/api/python/pyspark.mllib.html) may dramatically improve performance and decrease maintenance time. And then later, again, you may need to switch to dedicated ML chips like TPU or the NVIDIA Volta. Having the ability to plan for this capacity while still making initial working solutions is a critical skill to have to implement pragmatic AI solutions that actually make it to production.

Real-World Recommendation Problems: Integration with Production APIs

I found many real-world problems surface in production in startups that build recommendations. These are problems that are not as heavily discussed in ML books. One such problem is the “cold-start problem.” In the examples using the Surprise framework, there is already a massive database of “correct answers.” In the real world, you have so few users or products it doesn’t make sense to train a model. What can you do?

A decent solution is to make the path of the recommendation engine follow three phases. For phase one, take the most popular users, content, or products and serve those out as a recommendation. As more UGC is created on the platform, for phase two, use similarity scoring (without training a model). Here is some “hand-coded” code I have used in production a couple of different times that did just that. First we have a Tanimoto score, or Jaccard distance, by another name.

"""Data Science Algorithms"""


def tanimoto(list1, list2):
    """tanimoto coefficient

    In [2]: list2=['39229', '31995', '32015']
    In [3]: list1=['31936', '35989', '27489',
        '39229', '15468', '31993', '26478']
    In [4]: tanimoto(list1,list2)
    Out[4]: 0.1111111111111111

    Uses intersection of two sets to determine numerical score

    """

    intersection = set(list1).intersection(set(list2))
    return float(len(intersection))/(len(list1)) +         len(list2) - len(intersection)

Next is HBD: Here Be Dragons. Follower relationships are downloaded and converted in a Pandas DataFrame.

import os
import pandas as pd

from .algorithms import tanimoto

def follows_dataframe(path=None):
    """Creates Follows Dataframe"""

    if not path:
        path = os.path.join(os.getenv('PYTHONPATH'),
          'ext', 'follows.csv')

    df = pd.read_csv(path)
    return df

def follower_statistics(df):
    """Returns counts of follower behavior

    In [15]: follow_counts.head()
        Out[15]:
        followerId
        581bea20-962c-11e5-8c10-0242528e2f1b    1558
        74d96701-e82b-11e4-b88d-068394965ab2      94
        d3ea2a10-e81a-11e4-9090-0242528e2f1b      93
        0ed9aef0-f029-11e4-82f0-0aa89fecadc2      88
        55d31000-1b74-11e5-b730-0680a328ea36      64
        Name: followingId, dtype: int64

    """


    follow_counts = df.groupby(['followerId'])['followingId'].        count().sort_values(ascending=False)
    return follow_counts

def follow_metadata_statistics(df):
    """Generates metadata about follower behavior
    
    In [13]: df_metadata.describe()
        Out[13]:
        count    2145.000000
        mean        3.276923
        std        33.961413
        min         1.000000
        25%         1.000000
        50%         1.000000
        75%         3.000000
        max      1558.000000
        Name: followingId, dtype: float64

    """

    dfs = follower_statistics(df)
    df_metadata = dfs.describe()
    return df_metadata

def follow_relations_df(df):
    """Returns a dataframe of follower with all relations"""

    df = df.groupby('followerId').followingId.apply(list)
    dfr = df.to_frame("follow_relations")
    dfr.reset_index(level=0, inplace=True)
    return dfr

def simple_score(column, followers):
    """Used as an apply function for dataframe"""

    return tanimoto(column,followers)

def get_followers_by_id(dfr, followerId):
    """Returns a list of followers by followerID"""

    followers = dfr.loc[dfr['followerId'] == followerId]
    fr = followers['follow_relations']
    return fr.tolist()[0]

def generate_similarity_scores(dfr, followerId,
          limit=10, threshold=.1):
    """Generates a list of recommendations for a followerID"""

    followers = get_followers_by_id(dfr, followerId)
    recs = dfr['follow_relations'].        apply(simple_score, args=(followers,)).            where(dfr>threshold).dropna().sort_values()[-limit:]
    filters_recs = recs.where(recs>threshold)
    return filters_recs

def return_similarity_scores_with_ids(dfr, scores):
    """Returns Scores and FollowerID"""

    dfs = pd.DataFrame(dfr, index=scores.index.tolist())
    dfs['scores'] = scores[dfs.index]
    dfs['following_count'] = dfs['follow_relations'].apply(len)
    return dfs

To use this API, you would engage with it by following this sequence.

In [1]: follows import *

In [2]: df = follows_dataframe()

In [3]: dfr = follow_relations_df(df)

In [4]: dfr.head()

In [5]: scores = generate_similarity_scores(dfr,
         "00480160-0e6a-11e6-b5a1-06f8ea4c790f")

In [5]: scores
Out[5]:
2144    0.000000
713     0.000000
714     0.000000
715     0.000000
716     0.000000
717     0.000000
712     0.000000
980     0.333333
2057    0.333333
3       1.000000
Name: follow_relations, dtype: float64

In [6]: dfs = return_similarity_scores_with_ids(dfr, scores)

In [6]: dfs
Out[6]:
                                followerId  980   76cce300-0e6a-11e6-83e2-0242528e2f1b  
2057  f5ccbf50-0e69-11e6-b5a1-06f8ea4c790f  
3     00480160-0e6a-11e6-b5a1-06f8ea4c790f  

                                       follow_relations    scores  980   [f5ccbf50-0e69-11e6-b5a1-06f8ea4c790f, 0048016...  0.333333  
2057  [76cce300-0e6a-11e6-83e2-0242528e2f1b, 0048016...  0.333333  
3     [f5ccbf50-0e69-11e6-b5a1-06f8ea4c790f, 76cce30...         1  

      following_count  
980                 2  
2057                2  
3                   2

This “phase 2” similarity score-based recommendation with the current implementation would need to be run as a batch API. Additionally, Pandas will eventually run into some performance problems at scale. Ditching it at some point for either PySpark or Pandas on Ray (https://rise.cs.berkeley.edu/blog/pandas-on-ray/?twitter=@bigdata) is going to be a good move.

For “phase 3,” it is finally time to pull out the big guns and use something like Surprise and/or PySpark to train an SVD-based model and figure out model accuracy. In the first part of your company’s history, though, why bother when there is little to no value in doing formal ML model training?

Another production API issue is how to deal with rejected recommendations. There is nothing more irritating to a user than to keep getting recommendations for things you don’t want or already have. So, yet another sticky production issue needs to be solved. Ideally, the user is given the ability to click, “do not show again” for a list of recommendations, or quickly your recommendation engine becomes garbage. Additionally, the user is telling you something, so why not take that signal and feed it back into your recommendation engine model?

< Back Page 5 of 7 Next >

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address

Using AI to Interact with User-Generated Content

This chapter is from the book

This chapter is from the book

This chapter is from the book 

Real-World Production Issues with Recommendations

Real-World Recommendation Problems: Integration with Production APIs

InformIT Promotional Mailings & Special Offers