This site runs best with JavaScript enabled.

NLP Based Recommender System Without User Preferences

Adam Louly

March 23, 2019

adam, data science, python

Recommender systems (RS) have evolved into a fundamental tool for helping users make informed decisions and choices, especially in the era of big data in which customers have to make choices from a large number of products and services. A lot of RS models and techniques have been proposed and most of them have achieved great success. Among them, the content-based RS and collaborative filtering RS are two representative ones. Their efficacy has been demonstrated by both research and industry communities.

We will be building a content-based recommender system based on a natural language processing approach, This approach is very useful when we are dealing with products that are featured by a description or a title (text data in general), like news, jobs, books etc ..

Let’s Start!

The Data

We will be using a dataset of projects from some freelancing websites, that were parsed from RSS feed for educational purposes only.

we will try to recommend interesting projects to the freelancers based on the projects that they have done previously.

1import pandas as pd
2projects = pd.read_csv("projects.csv")
3projects_done= pd.read_csv("projects_done_by_users_history.csv")
4projects.head()

Dateframe of projects

The projects done history table comprises 2 rows, user_id and project_id.

Now let’s prepare the data and clean it.

1projects = projects.drop_duplicates(subset="Title", keep='first', inplace=False)
2projects["Category"] = projects["Category"].fillna('')
3projects["Description"] = projects["Description"].fillna('')
4projects["SubCategory"] = projects["SubCategory"].fillna('')
5projects['Title']= projects['Title'].fillna('')

Let's combine the whole rows as one row for each document

1projects["text"] = projects["Category"] + projects["Description"] + projects["SubCategory"] + projects["Title"]

Now we will calculate TF-IDF “Term Frequency — Inverse Data Frequency”

1from sklearn.feature_extraction.text import TfidfVectorizer
2
3tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
4tfidf_matrix = tf.fit_transform(projects['text'])

Its time to generate the cosine sim matrix

1from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
2
3cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

Now we’re done, our freelancers are free for open projects .. let’s recommend some interesting projects for them :))

getUserDone function will return a list that contains all the ids of done projects by the user.

get_recommendations will return a list with top similar projects to the projects done by a specific user, the list will be sorted ascending. titles = projects['Title'] indices = pd.Series(projects.index, index=projects['Title'])

1def getUserDone(user_id):
2 ProjectsDone = projects_done[users_done['userid'] == user_id]
3 print(ProjectsDone["projectid"])
4 return ProjectsDone["projectid"].values.tolist()
5
6def get_recommendations(user_id):
7 id = getUserDone(user_id)
8 sim_scores = []
9 for idx in id:
10 sim_scores = sim_scores + list(enumerate(cosine_sim[idx]))
11 sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
12 project_indices = [i[0] for i in sim_scores]
13 return titles.iloc[project_indices][len(id):]

Let's recommend top 25 projects that user with id 20 may be interested on.

1get_recommendations(20).head(25)

Top 25 Recommendations

Conclusion

In this article, we have built a content-based recommender using an NLP approach that is very useful when it comes to products with text data.

Without the need for users preferences, we can recommend for them quality recommendations, just from their completed projects.

This approach is also having pros, that's why it is always recommended to use multiple approaches, and combine them for accurate results.

Share article 👉

Liked this article? Make sure to join my Newsletter.