by Jennifer Wei
figures by Jeep Veerasak Srisuknimit
Ever wonder how music-streaming services such as Spotify and Pandora find songs that you like? Or how Facebook and Google find stories that are interesting to you? Many technology companies use machine learning algorithms to give personalized product suggestions; these algorithms can be found everywhere on the internet. One such algorithm may have even led you to the Science in the News article that you are now reading.
Essentially, an algorithm is a set of instructions detailing how to complete a certain task. As an example, consider the task of sorting a list of words alphabetically. To do this, one approach would be to move all the names starting with ‘A’ to the beginning of the list, then B’s, C’s and so on, working your way through the alphabet. Next, you might sort each word with the same first letter by the second letter, and then the third, etc. until the list is alphabetized. This set of instructions for sorting words is an algorithm; programmers write similar sets of instructions to direct the tasks and functions of machines, from our cell phones to our computers.
In today’s technology-driven world, machine learning algorithms are commonly used to direct computers to process and analyze data in an adaptive way. In other words, true to its name, machine learning is a subset of computer science in which algorithms “learn” patterns from datasets and improve their predictions about this data over time. In the case of many tech companies, the task of these algorithms is to take data collected from millions of internet users and generate personalized recommendations for each person. So how do these algorithms give good suggestions for everyone? This article will give a brief explanation of how this learning process works and specifically, how companies like Netflix, Spotify, and Pandora come up with their recommendations.
Data and Goals are Key to Machine Learning Predictions
Two things are necessary for a company to do machine learning: a large amount of user data and a well-specified goal that the company wants to achieve. The user data comes from user information provided both on and off their website (Figure 1). The more data that machine learning algorithms have to tune and test their mathematical models, the better their predictions will be for user behavior, and thus, the higher the quality of their recommendations. Equally important is having a well-specified goal, meaning a goal with a concrete number attached to it — preferably a number that can be easily and accurately measured from the available data. Setting this goal correctly is critical; not only will it ensure that the data analysis is unambiguous, but it will also set priorities for the algorithm when it makes recommendations.
To get a sense of what data websites can collect from their users, try visiting the Newsweek article “What is Code?” a couple of times. Every time you visit, the webpage will not only take you to where you left off, but also a little blue bot on the side will tell you how many times you’ve visited the website and how long you’ve spent reading the article. This sort of behavioral information from millions of users is the main source of data for machine learning algorithms. Many tech companies collect your browsing history as input for their algorithms, which then generate outputs such as ads and news stories predicted to be the most relevant to you.
Now let’s consider what kinds of goals a company such as Netflix can set. If Netflix wanted to ‘provide the most satisfying movie/TV service to its customers’, how would Netflix measure customer satisfaction? User ratings would be one way, but many users don’t provide ratings for everything they watch. By using this ratings metric, Netflix would have a lot of unknowns for user satisfaction, making performance towards this goal difficult to measure accurately.
If instead, the goal of Netflix was to ‘maximize the number of hours the user spends per week on their service,’ the data needed to measure this target is readily available to the company. This means that performance towards this second set of goals will be a lot easier to measure. However, this goal implies that Netflix prioritizes having its users spend lots of time on its service, which isn’t identical to the previous goal of providing the most satisfying streaming service. Netflix would probably aim for a balance between these goals, possibly by setting multiple easily-measured targets to represent their goal.
Recommender Algorithms: Machine learning for tailored suggestions
Once a measurable goal has been decided upon, and there is enough data from users, machine learning algorithms can be trained to give personalized suggestions to its users. This type of algorithm is known as a recommender algorithm. These algorithms make suggestions using two different approaches: A) the user-based approach, which suggests products that are favored by other people with similar tastes to the user, or B) the product-based approach, which directly compares products and finds other products that are similar to what the user likes.
Spotify’s Discover Weekly playlist is an example of the user-based approach. To get an idea of how this works, imagine you and your friend both like classic rock songs such as Queen’s ‘Bohemian Rhapsody’. Knowing that you and your friend have similar tastes, you might ask your friend for music suggestions. Spotify’s algorithm takes a similar approach, except with their user database, they’re able to ask thousands of other users that share your taste in music for suggestions (Figure 2a). Their algorithm determines which users have similar music interests based on the contents of their playlists. So, to make a suggestion for you, they would look for other users who also have ‘Bohemian Rhapsody’ in one of their playlists. Then, they would take the other songs those users put into those playlists and use them as suggestions for you. You might imagine that many users who have ‘Bohemian Rhapsody’ in a playlist may also have some Beatles songs in the same playlist; these Beatles songs would then show up in your Discover Weekly playlist as recommendations.
Meanwhile, Pandora’s algorithms use the product-based approach. Such algorithms do not rely on user input to make suggestions; instead, they suggest by finding songs that have similar characteristics. Pandora’s characterization of songs is handled by their Music Genome Project: every song is characterized according to 450 features. Then, the algorithm searches through their music database to find other songs that have the same features (Figure 2b). For example, when I created a ‘Bohemian Rhapsody’ station on Pandora, the algorithm told me that some of the characteristics of this song include strong male vocals and guitar solos. I then received recommendations for other songs with similar traits: Led Zeppelin’s ‘Stairway to Heaven’ and Aerosmith’s ‘Dream On’.
How Companies Improve their Recommendations
The development of recommender algorithms is still an active area of research. There are many ways to blend the approaches, and depending on the nuances of the goal, one approach might be more favorable than another. One such consideration is the issue of exploration versus exploitation. Is it better for the algorithm to give ‘safe’ suggestions that are very similar to the user’s favorites? Or, is it good for the algorithm to reach out into areas that are less similar, and come up with fresh suggestions that might be more interesting to the user. Another issue to consider is how to make recommendations to brand new users without a lot of input data. What are the best ways to guess what products a new user will like based on limited data? Machine learning practitioners are always experimenting with methods; there isn’t one algorithm that works best for all scenarios.
Despite small differences in recommendation methods, these algorithms have made an undeniably large impact in giving all internet users personalized suggestions. As more of our lives and activities are handled through the internet, it is likely that the role played by machine learning algorithms will increase as well. However, the convenience for on-demand, tailored recommendations comes at the price of releasing an increasing amount of information about ourselves to tech companies, and to the internet at large.
Jennifer Wei is a fifth year graduate student in Chemical Physics at Harvard University. Her research uses machine learning algorithms to predict organic chemistry.
This article is part of a Special Edition on Artificial Intelligence.
For more information:
For a more technical yet general overview of various machine learning techniques
The winner of a machine learning competition hosted by Netflix
Technical Information for Recommender Algorithms
A Conference on Recommender Systems
Machine Learning and Data Privacy