机器学习学习笔记--关联算法

关联是机器学习中研究的一种问题,研究事物的关系,其中啤酒与尿布的故事是这个问题的典型。这里要引入几个概念:

1.支持度(Support)

支持度表示项集{X,Y}在总项集里出现的概率。公式为:

Support(X→Y) = P(X,Y) / P(I) = P(X∪Y) / P(I) = num(XUY) / num(I)

其中,I表示总事务集。num()表示求事务集里特定项集出现的次数。

比如,num(I)表示总事务集的个数

num(X∪Y)表示含有{X,Y}的事务集的个数(个数也叫次数)。

2.置信度 (Confidence)

置信度表示在先决条件X发生的情况下,由关联规则”X→Y“推出Y的概率。即在含有X的项集中,含有Y的可能性,公式为:

Confidence(X→Y) = P(Y|X)  = P(X,Y) / P(X) = P(XUY) / P(X)

3.提升度(Lift)

提升度表示含有X的条件下,同时含有Y的概率,与Y总体发生的概率之比。

Lift(X→Y) = P(Y|X) / P(Y)

一、Apriori算法

Apriori 算法是一种最有影响力的挖掘布尔关联规则的频繁项集的 算法,它是由Rakesh Agrawal 和RamakrishnanSkrikant 提出的。

def createC1(dataSet):

C1 = []

for transaction in dataSet:

for item in transaction:

if [item] not in C1:

C1.append([item])

C1.sort()

return map(frozenset, C1)

def scanD(D, Ck, minSupport):

ssCnt = {}

for tid in D:

for can in Ck:

if can.issubset(tid):

ssCnt[can] = ssCnt.get(can, 0) + 1

numItems = float(len(D))

retList = []

supportData = {}

for key in ssCnt:

support = ssCnt[key] / numItems

if support >= minSupport:

retList.insert(0, key)

supportData[key] = support

return retList, supportData

def aprioriGen(Lk, k):

retList = []

lenLk = len(Lk)

for i in range(lenLk):

for j in range(i + 1, lenLk):

L1 = list(Lk[i])[: k - 2];

L2 = list(Lk[j])[: k - 2];

L1.sort();

L2.sort()

if L1 == L2:

retList.append(Lk[i] | Lk[j])

return retList

def apriori(dataSet, minSupport=0.5):

C1 = createC1(dataSet)

D = map(set, dataSet)

L1, suppData = scanD(D, C1, minSupport)

L = [L1]

k = 2

while (len(L[k - 2]) > 0):

Ck = aprioriGen(L[k - 2], k)

Lk, supK = scanD(D, Ck, minSupport)

suppData.update(supK)

L.append(Lk)

k += 1

return L, suppData

def calcConf(freqSet, H, supportData, brl, minConf=0.7):

prunedH = []

for conseq in H:

conf = supportData[freqSet] / supportData[freqSet - conseq]

if conf >= minConf:

print freqSet - conseq, '-->', conseq, 'conf:', conf

brl.append((freqSet - conseq, conseq, conf))

prunedH.append(conseq)

return prunedH

def rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7):

m = len(H[0])

if len(freqSet) > m + 1:

Hmp1 = aprioriGen(H, m + 1)

Hmp1 = calcConf(freqSet, Hmp1, supportData, brl, minConf)

if len(Hmp1) > 1:

rulesFromConseq(freqSet, Hmp1, supportData, brl, minConf)

def generateRules(L, supportData, minConf=0.7):

bigRuleList = []

for i in range(1, len(L)):

for freqSet in L[i]:

H1 = [frozenset([item]) for item in freqSet]

if i > 1:

rulesFromConseq(freqSet, H1, supportData, bigRuleList, minConf)

else:

calcConf(freqSet, H1, supportData, bigRuleList, minConf)

return bigRuleList

if __name__ == '__main__':

myDat = [ [ 1, 3, 4 ], [ 2, 3, 5 ], [ 1, 2, 3, 5 ], [ 2, 5 ] ]

L, suppData = apriori(myDat, 0.5)

rules = generateRules(L, suppData, minConf=0.7)

print 'rules:\n', rules

2.FP-growth算法

FP-growth算法是在Apriori改造而来的加快了速度

要使用这个算法需要下载库

sudo pip install pyfpgrowth

#-*- coding:utf-8 -*-

import pyfpgrowth

transactions = [[1, 2, 5],

[2, 4],

[2, 3],

[1, 2, 4],

[1, 3],

[2, 3],

[1, 3],

[1, 2, 3, 5],

[1, 2, 3]]

patterns = pyfpgrowth.find_frequent_patterns(transactions, 2)#2 这个位置的参数代表支持度

rules = pyfpgrowth.generate_association_rules(patterns, 0.7)#0.7 这个位置的参数代表置信度

print rules

推荐阅读 更多精彩内容