用数据方法进行简单商品推荐
背景介绍
当顾客在购买一件商品时,商家可以趁机了解他们还想买什么,以便把多数顾客愿意同时购买的商品放到一起销售以提升销售额。当商家收集到足够多的数据时,就可以对其进行亲和性分析,以确定哪些商品适合放在一起出售。
什么是亲和性呢,简单的说就是物品之间的相似性或者说是相关性。比如说,一个去商场购物,买了苹果的同时也买了香蕉,如果又买苹果又买香蕉的人比较多,那么我们把苹果和香蕉摆放在一起来销售,往往可以提高销量。这背后的思想就是人们经常购买同一件商品,下次大概率还是会继续购买。看似简单的思想,的确是很多线上和线下商品推荐服务的基础。
之前的商品推荐工作,常常是人工在线下来完成的,费时费力,也没有很好地精准度。现在我们可以用数据驱动的方式来自动完成。节约成本,也提高了效率,下面我们来看看如何来做。
数据准备和介绍
import numpy as np dataset_filename = "affinity_dataset.txt" X = np.loadtxt(dataset_filename) n_samples, n_features = X.shape print("This dataset has {0} samples and {1} features".format(n_samples, n_features))结果是:This dataset has 100 samples and 5 features
我们来解释下这个数据,看看顾客在前五次交易中都买了什么
print(X[:5]) [[ 0. 0. 1. 1. 1.][ 1. 1. 0. 1. 0.][ 1. 0. 1. 1. 0.][ 0. 0. 1. 1. 1.][ 0. 1. 0. 0. 1.]]竖着看,每一列分别表示一种商品的购买情况。分别是面包、牛奶、奶酪、苹果和香蕉。举个例子,第一行表示一个顾客,买了奶酪、苹果和香蕉。而没有买别的商品。每一行表示的是一次顾客购买行为。
数据处理
我们把数据特征打上标签,方便后面做处理:
# The names of the features, for your reference. features = ["bread", "milk", "cheese", "apples", "bananas"]我们下面来做一个顾客既买苹果又买香蕉的支持度和置信度,这里支持度指的是,对于总体而言,有多少样本符合这个规则。置信度是:支持度/总体,比如说对于这个规则而言总是是买苹果也买香蕉+买苹果不买香蕉的总人数的和。即,只要他买苹果,就算做是总体中的一员。
# How many of the cases that a person bought Apples involved the people purchasing Bananas too? # Record both cases where the rule is valid and is invalid. rule_valid = 0 rule_invalid = 0 for sample in X:if sample[3] == 1: # This person bought Applesif sample[4] == 1:# This person bought both Apples and Bananasrule_valid += 1else:# This person bought Apples, but not Bananasrule_invalid += 1 print("{0} cases of the rule being valid were discovered".format(rule_valid)) print("{0} cases of the rule being invalid were discovered".format(rule_invalid))输出结果是
21 cases of the rule being valid were discovered 15 cases of the rule being invalid were discovered根据排列组合的知识,我们知道如果5种商品两两随机组合的话,一共有10种组合方式(C25C52),我们计算所有组合的置信度,并把排名前三的打印出来:
import numpy as np dataset_filename = "affinity_dataset.txt" X = np.loadtxt(dataset_filename) n_samples, n_features = X.shape print("This dataset has {0} samples and {1} features".format(n_samples, n_features))# The names of the features, for your reference. features = ["bread", "milk", "cheese", "apples", "bananas"]from collections import defaultdict # Now compute for all possible rules valid_rules = defaultdict(int) invalid_rules = defaultdict(int) num_occurences = defaultdict(int) #num_occurances represents the same number of rulesfor sample in X: # (sample means record of buying fruit)for premise in range(n_features):if sample[premise] == 0: continue# Record that the premise was bought in another transactionnum_occurences[premise] += 1for conclusion in range(n_features):'''根据排列组合的规则,我这里希望按照1,2,3,4; 2,3,4; 3,4;4这样的顺序进行比较。这样的话,比较10次,就遍历完所有的情况。基于此,有了最外层的if...else语句第一句话是为了让他按照我前面说的那个顺序走,后面的判断语句,保证不遍历超出范围'''conclusion = conclusion + premise if conclusion < n_features:if premise == conclusion: # It makes little sense to measure if X -> X.continueif sample[conclusion] == 1:# This person also bought the conclusion itemvalid_rules[(premise, conclusion)] += 1else:# This person bought the premise, but not the conclusioninvalid_rules[(premise, conclusion)] += 1else:continuesupport = valid_rules confidence = defaultdict(float) for premise, conclusion in valid_rules.keys():confidence[(premise, conclusion)] = valid_rules[(premise, conclusion)] / num_occurences[premise]最后我们来进行排序操作,打印前三个结果。先来看一下我们处理之后的结果都是什么样子的
# 用于打印 Python 数据结构. 当你在命令行下打印特定数据结构时你会发现它很有用(输出格式比较整齐, 便于阅读). from pprint import pprint pprint(list(support.items())) [((0, 1), 14),((1, 2), 7),((3, 2), 25),((1, 3), 9),((0, 2), 4),((3, 0), 5),((4, 1), 19),((3, 1), 9),((1, 4), 19),((2, 4), 27),((2, 0), 4),((2, 3), 25),((2, 1), 7),((4, 3), 21),((0, 4), 17),((4, 2), 27),((1, 0), 14),((3, 4), 21),((0, 3), 5),((4, 0), 17)]我们给输出定义一个函数形式,方面后面进行输出:
因为我们之前写了一个feature列表,这样的话就很容易锁定到具体产品信息,只用一个列表就可以搞定,不用定义字典(这是一个不错的思路)
示例输出:
premise = 1 conclusion = 3 print_rule(premise, conclusion, support, confidence, features)Rule: If a person buys milk they will also buy apples
- Confidence: 0.196
- Support: 9
然后进行排序操作,我们按照置信度大小进行排序,降序:
# sort and print the first three resultfrom operator import itemgetter sorted_confidence = sorted(confidence.items(), key=itemgetter(1), reverse=True) for index in range(3):print("Rule #{0}".format(index + 1))(premise, conclusion) = sorted_confidence[index][0]print_rule(premise, conclusion, support, confidence, features)结果如下:
Rule #1
Rule: If a person buys cheese they will also buy bananas
- Confidence: 0.659
- Support: 27
Rule #2
Rule: If a person buys bread they will also buy bananas
- Confidence: 0.630
- Support: 17
Rule #3
Rule: If a person buys cheese they will also buy apples
- Confidence: 0.610
- Support: 25
从排序结果来看,“顾客买苹果,也会买奶酪”和“顾客买奶酪,也会买香蕉”,这两条规 则的支持度和置信度都很高。超市经理可以根据这些规则来调整商品摆放位置。例如,如果本周苹果促销,就在旁边摆上奶酪。或许可以提高超市销量哦。
参考资料:
《python数据挖掘入门与实践》
数据集
总结
以上是生活随笔为你收集整理的用数据方法进行简单商品推荐的全部内容,希望文章能够帮你解决所遇到的问题。
- 上一篇: 谷歌云盘Colaboratory如何载入
- 下一篇: 用OneR算法对Iris植物数据进行分类