Python深度学习12——Keras实现注意力机制(self-attention)中文的文本情感分类（详细注释）_keras注意力机制_阡之尘埃的博客

link之家
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
中文数据预处理

由于中文不像英文中间有空白可以直接划分词语，需要依靠jieba库切词，然后把没有用的标点符号，或者是“了”，‘的’，‘也’，‘就’，‘很’.....等等没有用的虚词去掉。这就需要一个停用词库，大家可以网上找常用的停用词文本，也可以留言找博主要。我这有一个比较全的停用词，我还有一个简化版的。本次使用的是简化版的停用词。
首先看数据长这样
导入包和数据，读取停用词，用jieba库划分词汇并处理
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif'] = ['KaiTi']  #指定默认字体 SimHei黑体
plt.rcParams['axes.unicode_minus'] = False   #解决保存图像是负号'
import jieba
stop_list  = pd.read_csv("stopwords_简略版.txt",index_col=False,quoting=3,
                         sep="\t",names=['stopword'], encoding='utf-8')
#Jieba分词函数
def txt_cut(juzi):
    lis=[w for w in jieba.lcut(juzi) if w not in stop_list.values]
    return " ".join(lis)
df=pd.read_excel('外卖.xlsx')
data=pd.DataFrame()
data['label']=df['label']
data['cutword']=df['review'].astype('str').apply(txt_cut)
词汇切割好了，得到如下结果 
 查看标签y的分布 
data['label'].value_counts().plot(kind='bar') 
 负面评价0有将近8000个，正面评价1有4000个，不平衡，划分训练测试集时要分层抽样。 
下面将文本变为数组，利用Keras里面的Tokenizer类实现，首先将词汇都索引化。这里有个参数num_words=6000很重要，意思是选择6000个词汇作为索引字典，也就是这个模型里面最多只有6000个词。 
from os import listdir
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
# 将文件分割成单字, 建立词索引字典     
tok = Tokenizer(num_words=6000)
tok.fit_on_texts(data['cutword'].values)
print("样本数 : ", tok.document_count)
print({k: tok.word_index[k] for k in list(tok.word_index)[:10]}) 
 由于每个评论的词汇长度不一样，我们训练时需要弄成一样长的张量（多剪少补），需要确定这个词汇最大长度为多少，也就是max_words参数，这个是循环神经网络的时间步的长度，也是注意力机制的维度。如果max_words过小则很多语句的信息损失了，而max_words过大数据矩阵又会过于稀疏，并且计算量过大。我们查看一下X的长度的分布频率： 
# 建立训练和测试数据集 
X= tok.texts_to_sequences(data['cutword'].values)
#查看x的长度的分布
length=[]
for i in X:
    length.append(len(i))
v_c=pd.Series(length).value_counts()
print(v_c[v_c>20])   #频率大于20才展现
v_c[v_c>20].plot(kind='bar',figsize=(12,5)) 
可以看出绝大多数的句子单词长度不超过10....长度为5的评论是最多的，本次选择max_words=20，将句子都裁剪为长为20 的向量。并取出y 
# 将序列数据填充成相同长度 
X= sequence.pad_sequences(X, maxlen=20)
Y=data['label'].values
print("X.shape: ", X.shape)
print("Y.shape: ", Y.shape)
#X=np.array(X)
#Y=np.array(Y) 
然后划分训练测试集，查看形状：  
X_train, X_test, Y_train, Y_test =  train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=0)
X_train.shape,X_test.shape,Y_train.shape, Y_test.shape 
将y进行独立热编码，并且保留原始的测试集y_test，方便后面做评价。查看x和y前3个 
Y_test_original=Y_test.copy()
Y_train = to_categorical(Y_train)
Y_test = to_categorical(Y_test)
print(X_train[:3])
print(Y_test[:3]) 
开始构建神经网络 
由于Keras里面没有封装好的注意力层，需要我们自己定义一个： 
#自定义注意力层
from keras import initializers, constraints,activations,regularizers
from keras import backend as K
from keras.layers import Layer
class Attention(Layer):
    #返回值：返回的不是attention权重，而是每个timestep乘以权重后相加得到的向量。
    #输入:输入是rnn的timesteps，也是最长输入序列的长度
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')
        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)
        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)
        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0
        super(Attention, self).__init__(**kwargs)
    def build(self, input_shape):
        assert len(input_shape) == 3
        self.W = self.add_weight(shape=(input_shape[-1],),initializer=self.init,name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,constraint=self.W_constraint)
        self.features_dim = input_shape[-1]
        if self.bias:
            self.b = self.add_weight(shape=(input_shape[1],),initializer='zero', name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,constraint=self.b_constraint)
        else:
            self.b = None
        self.built = True
    def compute_mask(self, input, input_mask=None):
        return None     ## 后面的层不需要mask了，所以这里可以直接返回none
    def call(self, x, mask=None):
        features_dim = self.features_dim    ## 这里应该是 step_dim是我们指定的参数，它等于input_shape[1],也就是rnn的timesteps
        step_dim = self.step_dim
        # 输入和参数分别reshape再点乘后，tensor.shape变成了(batch_size*timesteps, 1),之后每个batch要分开进行归一化
         # 所以应该有 eij = K.reshape(..., (-1, timesteps))
        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)),K.reshape(self.W, (features_dim, 1))), (-1, step_dim))
        if self.bias:
            eij += self.b        
        eij = K.tanh(eij)    #RNN一般默认激活函数为tanh, 对attention来说激活函数差别不大，因为要做softmax
        a = K.exp(eij)
        if mask is not None:    ## 如果前面的层有mask，那么后面这些被mask掉的timestep肯定是不能参与计算输出的，也就是将他们attention权重设为0
            a *= K.cast(mask, K.floatx())   ## cast是做类型转换，keras计算时会检查类型，可能是因为用gpu的原因
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
        a = K.expand_dims(a)      # a = K.expand_dims(a, axis=-1) , axis默认为-1， 表示在最后扩充一个维度。比如shape = (3,)变成 (3, 1)
        ## 此时a.shape = (batch_size, timesteps, 1), x.shape = (batch_size, timesteps, units)
        weighted_input = x * a    
        # weighted_input的shape为 (batch_size, timesteps, units), 每个timestep的输出向量已经乘上了该timestep的权重
        # weighted_input在axis=1上取和，返回值的shape为 (batch_size, 1, units)
        return K.sum(weighted_input, axis=1)
    def compute_output_shape(self, input_shape):    ## 返回的结果是c，其shape为 (batch_size, units)
        return input_shape[0],  self.features_dim 
别管这个类多复杂.....不用看，后面直接当成函数用就行。 
下面导入Keras里面的常用的神经网络层，定义一些参数 
from keras.preprocessing import sequence
from keras.models import Sequential,Model
from keras.layers import Dense,Input, Dropout, Embedding, Flatten,MaxPooling1D,Conv1D,SimpleRNN,LSTM,GRU,Multiply
from keras.layers import Bidirectional,Activation,BatchNormalization
from keras.layers.merge import concatenate
seed = 10
np.random.seed(seed)  # 指定随机数种子  
#单词索引的最大个数6000，单句话最大长度20
top_words=6000  
max_words=20
num_labels=2  #2分类 
下面构建模型函数，这个函数较为复杂，因为是12个模型一起定义的，方便代码的复用。但每个模型对应的那一块都写的很清楚： 
def build_model(top_words=top_words,max_words=max_words,num_labels=num_labels,mode='LSTM',hidden_dim=[32]):
    if mode=='RNN':
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Dropout(0.25))
        model.add(SimpleRNN(32))  
        model.add(Dropout(0.25))   
        model.add(Dense(num_labels, activation="softmax"))
    elif mode=='MLP':
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Dropout(0.25))
        model.add(Flatten())
        model.add(Dense(256, activation="relu"))  
        model.add(Dropout(0.25))   
        model.add(Dense(num_labels, activation="softmax"))
    elif mode=='LSTM':
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Dropout(0.25))
        model.add(LSTM(32))
        model.add(Dropout(0.25))   
        model.add(Dense(num_labels, activation="softmax"))
    elif mode=='GRU':
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Dropout(0.25))
        model.add(GRU(32))
        model.add(Dropout(0.25))   
        model.add(Dense(num_labels, activation="softmax"))
    elif mode=='CNN':        #一维卷积
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Dropout(0.25))
        model.add(Conv1D(filters=32, kernel_size=3, padding="same",activation="relu"))
        model.add(MaxPooling1D(pool_size=2))
        model.add(Flatten())
        model.add(Dense(256, activation="relu"))
        model.add(Dropout(0.25))   
        model.add(Dense(num_labels, activation="softmax"))
    elif mode=='CNN+LSTM':
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Dropout(0.25))    
        model.add(Conv1D(filters=32, kernel_size=3, padding="same",activation="relu"))
        model.add(MaxPooling1D(pool_size=2))
        model.add(LSTM(64))
        model.add(Dropout(0.25))   
        model.add(Dense(num_labels, activation="softmax"))
    elif mode=='BiLSTM':
        model = Sequential()
        model.add(Embedding(top_words, 32, input_length=max_words))
        model.add(Bidirectional(LSTM(64)))
        model.add(Dense(128, activation='relu'))
        model.add(Dropout(0.25))
        model.add(Dense(num_labels, activation='softmax'))
    #下面的网络采用Funcional API实现
    elif mode=='TextCNN':
        inputs = Input(name='inputs',shape=[max_words,], dtype='float64')
        ## 词嵌入使用预训练的词向量
        layer = Embedding(top_words, 32, input_length=max_words, trainable=False)(inputs)
        ## 词窗大小分别为3,4,5
        cnn1 = Conv1D(32, 3, padding='same', strides = 1, activation='relu')(layer)
        cnn1 = MaxPooling1D(pool_size=2)(cnn1)
        cnn2 = Conv1D(32, 4, padding='same', strides = 1, activation='relu')(layer)
        cnn2 = MaxPooling1D(pool_size=2)(cnn2)
        cnn3 = Conv1D(32, 5, padding='same', strides = 1, activation='relu')(layer)
        cnn3 = MaxPooling1D(pool_size=2)(cnn3)
        # 合并三个模型的输出向量
        cnn = concatenate([cnn1,cnn2,cnn3], axis=-1)
        flat = Flatten()(cnn) 
        drop = Dropout(0.2)(flat)
        main_output = Dense(num_labels, activation='softmax')(drop)
        model = Model(inputs=inputs, outputs=main_output)
    elif mode=='Attention':
        inputs = Input(name='inputs',shape=[max_words,], dtype='float64')
        layer = Embedding(top_words, 32, input_length=max_words, trainable=False)(inputs)
        attention_probs = Dense(32, activation='softmax', name='attention_vec')(layer)
        attention_mul =  Multiply()([layer, attention_probs])
        mlp = Dense(64)(attention_mul) #原始的全连接
        fla=Flatten()(mlp)
        output = Dense(num_labels, activation='softmax')(fla)
        model = Model(inputs=[inputs], outputs=output)  
    elif mode=='Attention*3':
        inputs = Input(name='inputs',shape=[max_words,], dtype='float64')
        layer = Embedding(top_words, 32, input_length=max_words, trainable=False)(inputs)
        attention_probs = Dense(32, activation='softmax', name='attention_vec')(layer)
        attention_mul =  Multiply()([layer, attention_probs])
        mlp = Dense(32,activation='relu')(attention_mul) 
        attention_probs = Dense(32, activation='softmax', name='attention_vec1')(mlp)
        attention_mul =  Multiply()([mlp, attention_probs])
        mlp2 = Dense(32,activation='relu')(attention_mul) 
        attention_probs = Dense(32, activation='softmax', name='attention_vec2')(mlp2)
        attention_mul =  Multiply()([mlp2, attention_probs])
        mlp3 = Dense(32,activation='relu')(attention_mul)           
        fla=Flatten()(mlp3)
        output = Dense(num_labels, activation='softmax')(fla)
        model = Model(inputs=[inputs], outputs=output)      
    elif mode=='BiLSTM+Attention':
        inputs = Input(name='inputs',shape=[max_words,], dtype='float64')
        layer = Embedding(top_words, 32, input_length=max_words, trainable=False)(inputs)
        bilstm = Bidirectional(LSTM(64, return_sequences=True))(layer)  #参数保持维度3
        bilstm = Bidirectional(LSTM(64, return_sequences=True))(bilstm)
        layer = Dense(256, activation='relu')(bilstm)
        layer = Dropout(0.2)(layer)
        ## 注意力机制 
        attention = Attention(step_dim=max_words)(layer)
        layer = Dense(128, activation='relu')(attention)
        output = Dense(num_labels, activation='softmax')(layer)
        model = Model(inputs=inputs, outputs=output)  
    elif mode=='BiGRU+Attention':
        inputs = Input(name='inputs',shape=[max_words,], dtype='float64')
        layer = Embedding(top_words, 32, input_length=max_words, trainable=False)(inputs)
        attention_probs = Dense(32, activation='softmax', name='attention_vec')(layer)
        attention_mul =  Multiply()([layer, attention_probs])
        mlp = Dense(64,activation='relu')(attention_mul) #原始的全连接
        #bat=BatchNormalization()(mlp)
        #act=Activation('relu')
        gru=Bidirectional(GRU(32))(mlp)
        mlp = Dense(16,activation='relu')(gru)
        output = Dense(num_labels, activation='softmax')(mlp)
        model = Model(inputs=[inputs], outputs=output) 
    model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
    return model 
 前几个简单的单一模型使用的是搭积木一样最简单的定义方式。后面复杂一点的模型都是使用的Functional API实现的。 
 下面再定义损失和精度的图,和混淆矩阵指标等等评价体系的函数 
#定义损失和精度的图,和混淆矩阵指标等等
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import cohen_kappa_score
def plot_loss(history):
    # 显示训练和验证损失图表
    plt.subplots(1,2,figsize=(10,3))
    plt.subplot(121)
    loss = history.history["loss"]
    epochs = range(1, len(loss)+1)
    val_loss = history.history["val_loss"]
    plt.plot(epochs, loss, "bo", label="Training Loss")
    plt.plot(epochs, val_loss, "r", label="Validation Loss")
    plt.title("Training and Validation Loss")
    plt.xlabel("Epochs")
    plt.ylabel("Loss")
    plt.legend()  
    plt.subplot(122)
    acc = history.history["accuracy"]
    val_acc = history.history["val_accuracy"]
    plt.plot(epochs, acc, "b-", label="Training Acc")
    plt.plot(epochs, val_acc, "r--", label="Validation Acc")
    plt.title("Training and Validation Accuracy")
    plt.xlabel("Epochs")
    plt.ylabel("Accuracy")
    plt.legend()
    plt.tight_layout()
    plt.show()
def plot_confusion_matrix(model,X_test,Y_test_original):
    #预测概率
    prob=model.predict(X_test) 
    #预测类别
    pred=np.argmax(prob,axis=1)
    #数据透视表，混淆矩阵
    table = pd.crosstab(Y_test_original, pred, rownames=['Actual'], colnames=['Predicted'])
    #print(table)
    sns.heatmap(table,cmap='Blues',fmt='.20g', annot=True)
    plt.tight_layout()
    plt.show()
    #计算混淆矩阵的各项指标
    print(classification_report(Y_test_original, pred))
    #科恩Kappa指标
    print('科恩Kappa'+str(cohen_kappa_score(Y_test_original, pred))) 
 定义训练函数 
#定义训练函数
def train_fuc(max_words=max_words,mode='BiLSTM+Attention',batch_size=32,epochs=10,hidden_dim=[32],show_loss=True,show_confusion_matrix=True):
    #构建模型
    model=build_model(max_words=max_words,mode=mode)
    print(model.summary())
    history=model.fit(X_train, Y_train,batch_size=batch_size,epochs=epochs,validation_split=0.2, verbose=1)
    print('————————————训练完毕————————————')
    # 评估模型
    loss, accuracy = model.evaluate(X_test, Y_test)
    print("测试数据集的准确度 = {:.4f}".format(accuracy))
    if show_loss:
        plot_loss(history)
    if show_confusion_matrix:
        plot_confusion_matrix(model=model,X_test=X_test,Y_test_original=Y_test_original) 
设定一些参数 
top_words=6000
max_words=20
batch_size=32
epochs=4
show_confusion_matrix=True
show_loss=True
mode='MLP'   
训练轮数为4，比较少，因为这个数据集少，而且太简单了，每个句子很短，所以前面单一模型很容易过拟合，就只训练个4轮，也能节约时间。 
下面开始一个个模型去训练并且评价： 
train_fuc(mode='MLP',batch_size=batch_size,epochs=epochs) 
 如图，给出了训练每一轮的损失精度，和验证集的损失精度。并且画图，然后测试集的精度，画出的混淆矩阵，计算了混淆矩阵的一些指标，还有科恩系数。MLP测试集精度为0.8795 
1DCNN 
#下面模型都是接受三维数据输入，先把X变个形状
X_train= X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test= X_test.reshape((X_test.shape[0], X_test.shape[1], 1))
train_fuc(mode='CNN',batch_size=batch_size,epochs=epochs) 
也差不多，精度为0.8882 
model='RNN' 
train_fuc(mode=model,epochs=epochs) 
结果类似，不展示那么多了，测试集精度为0.8912 
train_fuc(mode='LSTM',epochs=epochs) 
结果类似，不展示了，测试集精度为0.8966  （目前来看最高） 
train_fuc(mode='GRU',epochs=epochs) 
测试集精度为0.8912 
CNN+LSTM 
train_fuc(mode='CNN+LSTM',epochs=epochs) 
测试集精度为0.8916 
BiLSTM  
train_fuc(mode='BiLSTM',epochs=epochs) 
测试数据集的准确度 0.8816
TextCNN 
train_fuc(mode='TextCNN',epochs=30) 
这里加大了训练轮数，因为下面的模型都开始比较复杂，不容易过拟合，而且需要更多的训练轮数 
测试集精度为0.8474 
 Attention 
train_fuc(mode='Attention',epochs=100) 
测试集精度为0.8207 
BiLSTM+Attention 
train_fuc(mode='BiLSTM+Attention',epochs=30) 
测试集精度0.8236 
BiGRU+Attention 
train_fuc(mode='BiGRU+Attention',epochs=100) 
测试集精度0.8607 
  Attention*3 
train_fuc(mode='Attention*3',epochs=50) 
测试集精度0.8057 
很明显，加了注意力机制的模型训练更加不容易过拟合。单一的循环网络才四轮就会过拟合，而注意力机制同时需要的训练轮数也更多，可以看到验证集精度一直在上升，损失一直在下降。 
虽然最后整体的测试集的准确率不如前面的单一网络，但我猜测这应该是训练轮数不够和数据量过小的原因。
 这个外卖的数据集实在是太短了，比较简单，而且样本量也不大。 
而且和Transform比起来，这里的注意力机制没有采用残差连接，批量归一化等技巧，没有使用编码解码器，也没有堆叠很多层(Transform有18个注意力层) 
以后可以在更复杂，更多的数据集上进行测试和训练注意力机制,把网络做大做深一点，多调参尝试，当然前提是需要有更多的计算资源(买台好电脑).....
  
inputs = Input(shape=(input_dims,))
attention_probs = Dense(input_dims, activation='softmax', name='attention_probs')(inputs)
attention_mul = merge([inputs, attention_probs], output_shape=input_dims, name='attention_mul', mode='mul')
 让我们考虑这个“ Hello World”示例：
 32个值的向量v作为模型的输入（简单前馈神经网络）。
 v [1] =目标。
 目标是二进制（0或1）。
 向量v的所有其他值（
				Keras注意力机制注意力机制导入安装包加载并划分数据集数据处理构建模型main函数
注意力机制
从大量输入信息里面选择小部分的有用信息来重点处理，并忽略其他信息，这种能力就叫做注意力（Attention）。分为 聚焦式注意力和基于显著性的注意力：
聚焦式注意力（Focus Attention）：自上而下的、有意识的注意力。指有预定目的、依赖任务的、主动有意识地聚焦于某一对象的注意力。
基于显著性的注意力（Saliency-Based Attention）：自下而上的、无意识的。不需要主动干预，和任务无关
				神经网络学习小记录63——Keras 各类注意力机制解析与代码详解学习前言什么是注意力机制
注意力机制是一个非常有效的tricks，注意力机制的实现方式有许多，我们一起来学习一下。
什么是注意力机制
				引入Attention 机制，对 LSTM 模型进行改进，设计了LSTM-Attention 模型。
实验环境：开python3.6.5、tensorflow==1.12、keras==2.2.4
本文的实验数据集来源于搜狗实验室中的搜狐新闻 数据，从中提取出用于训练中文词向量的中文语料， 大小约为 4GB 左右．然后选取了10 个类别的新闻数据，分别为体育, 财经, 房产, 家居, 教育, 科技, 时尚, 时政, 游戏, 娱乐．每个类别 新闻为5000 条，共 50000 条新闻，利用这 50000 条 数据来训练模型．其测试集和验证集如下
验证集: 500*10
测试集: 1000*10
				事件抽取（event extraction）是自然语言处理（natural language processing，NLP）中的一个重要且有挑战性的任务，以完成从文本中识别出事件触发词（trigger）以及触发词对应的要素（argument）。对于一个句子中有多个事件的多事件抽取任务，提出了一种注意力机制的变种——动态掩蔽注意力机制（dynamic masked attention network，DyMAN），与常规注意力机制相比，动态掩蔽注意力机制能够捕捉更丰富的上下文表示并保留更有价值的信息。在ACE 2005数据集上进行的实验中，对于多事件抽取任务，与之前最好的模型JRNN相比，DyMAN模型在触发词分类任务上取得了9.8%的提升，在要素分类任务上取得了4.5%的提升，表明基于DyMAN的事件抽取模型在多事件抽取上能够实现领先的效果。
默认情况下，注意力层使用附加注意力并在计算相关性时考虑整个上下文。 以下代码创建了一个注意力层，它遵循第一部分中的方程（ attention_activation是e_{t, t'}的激活函数）： 
 import keras
from keras_self_attention import SeqSelfAttention
model = keras . models . Sequential ()
model . add ( keras . layers . Embedding ( input_dim = 10000 ,
                                 output_dim = 
def channel_attention(input_feature, ratio=8):
	channel_axis = 1 if K.image_data_format() == "channels_first" else -1
	channel = input_feature._keras_shape[channel_axis]
	shared_layer_one = Dense(channel//rat.....
code:https://github.com/dcdcvgroup/FcaNet
这篇论文，将GAP推广到一种更为一般的2维的离散余弦变换（DCT）形式，通过引入更多的frequency analysis重新考虑通道的注意力。
注意机制，特别是通道注意，在计算机视觉领域取得了巨大的成功。许多工作集中在如何设计有效的通道注意机制，而忽略了一个基本问题，即通道注意机制使用标量来表示通道，这由于大量信息损失而困难。在这项工
	def __init__(self, output_dim, **kwargs):
		self.output_dim = output_dim
		super(Self_Attention, self).__init__(**kwargs)
	def build(self, input_shape):
		# 为该层创建一个可训练的权重
		# inputs.shape = (batch_size, time
# [...]
m = Sequential ([
      LSTM ( 128 , input_shape = ( seq_length , 1 ), return_sequences = True ),
      Attention (), # <--------- here.
      Dense ( 1 , activation = 'linear' )
在运行示例之前，请先
IMDB影评高度分类数据集，来自IMDB的25,000条影评，被标记为正面/纵向两种评价。影评已被预先为词下标构成的序列。方便起见，单词的下标基于它在数据集中出现的频率标定，例如整数3所编码的词为数据集中第3常出现的词。
按照惯例，0不代表任何特定的词，而编码为任何未知单词。
$ python imdb_attention.py
训练时间（每纪元）
 Val准确率
Val损失
所需Epoch数
0.8339
 0.3815
双向LSTM
安装命令：pip install keras-self-attention
2、基本用法
默认情况下，注意力层使用附加注意力，并在计算相关性时考虑整个上下文。下面创建一个attention层，它遵循1中的方程（attention_activation是et,t`的激活功能）
import keras
from keras_self_attention import SeqSelfAttention
model = keras.models.Se