基于词编码模型制作表情包

本文实现一个将文本转化成表情包的模型。由于使用了词编码模型，尽管使用很少的训练数据，也可以准确的将没有出现的词汇映射到相同的表情包上。本文先从一个简单的模型（Emojifier-V1）出发，介绍词嵌入模型的用法，然后引入LSTM模型，改进为可以利用上下文信息的更加先进的表情包模型。

import numpy as np
from emo_utils import *
import emoji
import matplotlib.pyplot as plt

%matplotlib inline

基础模型：Emojifier-V1

EMOJISET数据集

首先，使用一个小的数据集，建立一个简单的基础模型：
- X包含127个句子
- Y包含0到4的整型标签，每个数字对应一个表情

图 1: EMOJISET - 五分类问题. 这里显示了一些句子的例子.

下面加载数据集，并分为训练集（127个样本）和测试集（56个样本）。

1 2	X_train, Y_train = read_csv('data/emoji_data/train_emoji.csv') X_test, Y_test = read_csv('data/emoji_data/tesss.csv')

1	maxLen = len(max(X_train, key=len).split())

1 2	index = 1 print(X_train[index], label_to_emoji(Y_train[index]))

I am proud of your achievements 😄

Emojifier-V1简介

下面实现基本模型Emojifier-v1：

图 2: 基本的模型 (Emojifier-V1).

模型的输入是句子（也就是字符串，比如"I love you"）。输出是形状为(1,5)的概率向量，输入argmax层输出最有可能的表情。

1 2	Y_oh_train = convert_to_one_hot(Y_train, C = 5) Y_oh_test = convert_to_one_hot(Y_test, C = 5)

1 2	index = 50 print(Y_train[index], "is converted into one hot", Y_oh_train[index])

0 is converted into one hot [ 1.  0.  0.  0.  0.]

实现Emojifier-V1

如图(2)所示，第一步是将输入的句子转化为词向量，然后取平均。这里使用首先训练好的50维GloVe编码模型获取词向量。

1	word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('model_data/glove.6B.50d.txt')

加载后得到：

word_to_index: 单词映射到单词表（400,001个单词，有效的索引范围0-400,000）索引的字典.
index_to_word: 索引映射到单词
word_to_vec_map: 单词映射到GloVe词向量

运行下面的单元，验证是否有效。

word = "cucumber"
index = 289846
print("the index of", word, "in the vocabulary is", word_to_index[word])
print("the", str(index) + "th word in the vocabulary is", index_to_word[index])

the index of cucumber in the vocabulary is 113317
the 289846th word in the vocabulary is potatos

下面实现sentence_to_avg()，包括两步：
1. 将所有的句子转化为小写，然后分割为词序列。参考X.lower()和X.split()
2. 对句子中的每个单词，获取它的GloVe词嵌入表示。然后进行平均

# GRADED FUNCTION: sentence_to_avg

def sentence_to_avg(sentence, word_to_vec_map):
    """
    Converts a sentence (string) into a list of words (strings). Extracts the GloVe representation of each word
    and averages its value into a single vector encoding the meaning of the sentence.
    
    Arguments:
    sentence -- string, one training example from X
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    
    Returns:
    avg -- average vector encoding information about the sentence, numpy-array of shape (50,)
    """
    
    ### START CODE HERE ###
    # Step 1: Split sentence into list of lower case words (≈ 1 line)
    words = (sentence.lower()).split()

    # Initialize the average word vector, should have the same shape as your word vectors.
    avg = np.zeros((50,))
    
    # Step 2: average the word vectors. You can loop over the words in the list "words".
    for w in words:
        avg += word_to_vec_map[w]
    avg = avg / len(words)
    
    ### END CODE HERE ###
    
    return avg

1 2	avg = sentence_to_avg("Morrocan couscous is my favorite dish", word_to_vec_map) print("avg = ", avg)

avg =  [-0.008005    0.56370833 -0.50427333  0.258865    0.55131103  0.03104983
 -0.21013718  0.16893933 -0.09590267  0.141784   -0.15708967  0.18525867
  0.6495785   0.38371117  0.21102167  0.11301667  0.02613967  0.26037767
  0.05820667 -0.01578167 -0.12078833 -0.02471267  0.4128455   0.5152061
  0.38756167 -0.898661   -0.535145    0.33501167  0.68806933 -0.2156265
  1.797155    0.10476933 -0.36775333  0.750785    0.10282583  0.348925
 -0.27262833  0.66768    -0.10706167 -0.283635    0.59580117  0.28747333
 -0.3366635   0.23393817  0.34349183  0.178405    0.1166155  -0.076433
  0.1445417   0.09808667]

计算句子的词嵌入平均之后，正传，然后计算损失函数，反传更新softmax层的参数。需要实现的公式包括：

\[ z^{(i)} = W . avg^{(i)} + b\] \[ a^{(i)} = softmax(z^{(i)})\] \[ \mathcal{L}^{(i)} = - \sum_{k = 0}^{n_y - 1} Yoh^{(i)}_k * log(a^{(i)}_k)\]

# GRADED FUNCTION: model

def model(X, Y, word_to_vec_map, learning_rate = 0.01, num_iterations = 400):
    """
    Model to train word vector representations in numpy.
    
    Arguments:
    X -- input data, numpy array of sentences as strings, of shape (m, 1)
    Y -- labels, numpy array of integers between 0 and 7, numpy-array of shape (m, 1)
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    learning_rate -- learning_rate for the stochastic gradient descent algorithm
    num_iterations -- number of iterations
    
    Returns:
    pred -- vector of predictions, numpy-array of shape (m, 1)
    W -- weight matrix of the softmax layer, of shape (n_y, n_h)
    b -- bias of the softmax layer, of shape (n_y,)
    """
    
    np.random.seed(1)

    # Define number of training examples
    m = Y.shape[0]                          # number of training examples
    n_y = 5                                 # number of classes  
    n_h = 50                                # dimensions of the GloVe vectors 
    
    # Initialize parameters using Xavier initialization
    W = np.random.randn(n_y, n_h) / np.sqrt(n_h)
    b = np.zeros((n_y,))
    
    # Convert Y to Y_onehot with n_y classes
    Y_oh = convert_to_one_hot(Y, C = n_y) 
    
    # Optimization loop
    for t in range(num_iterations):                       # Loop over the number of iterations
        for i in range(m):                                # Loop over the training examples
            
            ### START CODE HERE ### (≈ 4 lines of code)
            # Average the word vectors of the words from the i'th training example
            avg = sentence_to_avg(X[i], word_to_vec_map)

            # Forward propagate the avg through the softmax layer
            z = np.dot(W, avg) + b
            a = softmax(z)

            # Compute cost using the i'th training label's one hot representation and "A" (the output of the softmax)
            cost = - np.sum(Y_oh[i] * np.log(a))
            ### END CODE HERE ###
            
            # Compute gradients 
            dz = a - Y_oh[i]
            dW = np.dot(dz.reshape(n_y,1), avg.reshape(1, n_h))
            db = dz

            # Update parameters with Stochastic Gradient Descent
            W = W - learning_rate * dW
            b = b - learning_rate * db
        
        if t % 100 == 0:
            print("Epoch: " + str(t) + " --- cost = " + str(cost))
            pred = predict(X, Y, W, b, word_to_vec_map)

    return pred, W, b

print(X_train.shape)
print(Y_train.shape)
print(np.eye(5)[Y_train.reshape(-1)].shape)
print(X_train[0])
print(type(X_train))
Y = np.asarray([5,0,0,5, 4, 4, 4, 6, 6, 4, 1, 1, 5, 6, 6, 3, 6, 3, 4, 4])
print(Y.shape)

X = np.asarray(['I am going to the bar tonight', 'I love you', 'miss you my dear',
 'Lets go party and drinks','Congrats on the new job','Congratulations',
 'I am so happy for you', 'Why are you feeling bad', 'What is wrong with you',
 'You totally deserve this prize', 'Let us go play football',
 'Are you down for football this afternoon', 'Work hard play harder',
 'It is suprising how people can be dumb sometimes',
 'I am very disappointed','It is the best day in my life',
 'I think I will end up alone','My life is so boring','Good job',
 'Great so awesome'])

print(X.shape)
print(np.eye(5)[Y_train.reshape(-1)].shape)
print(type(X_train))

(132,)
(132,)
(132, 5)
never talk to me again
<class 'numpy.ndarray'>
(20,)
(20,)
(132, 5)
<class 'numpy.ndarray'>

运行下面的计算单元，训练模型，并更新softmax层参数

1 2	pred, W, b = model(X_train, Y_train, word_to_vec_map) print(pred)

Epoch: 0 --- cost = 1.95204988128
Accuracy: 0.348484848485
Epoch: 100 --- cost = 0.0797181872601
Accuracy: 0.931818181818
Epoch: 200 --- cost = 0.0445636924368
Accuracy: 0.954545454545
Epoch: 300 --- cost = 0.0343226737879
Accuracy: 0.969696969697
[[ 3.]
 [ 2.]
 [ 3.]
 [ 0.]
 [ 4.]
 [ 0.]
 [ 3.]
 [ 2.]
 [ 3.]
 [ 1.]
 [ 3.]
 [ 3.]
 [ 1.]
 [ 3.]
 [ 2.]
 [ 3.]
 [ 2.]
 [ 3.]
 [ 1.]
 [ 2.]
 [ 3.]
 [ 0.]
 [ 2.]
 [ 2.]
 [ 2.]
 [ 1.]
 [ 4.]
 [ 3.]
 [ 3.]
 [ 4.]
 [ 0.]
 [ 3.]
 [ 4.]
 [ 2.]
 [ 0.]
 [ 3.]
 [ 2.]
 [ 2.]
 [ 3.]
 [ 4.]
 [ 2.]
 [ 2.]
 [ 0.]
 [ 2.]
 [ 3.]
 [ 0.]
 [ 3.]
 [ 2.]
 [ 4.]
 [ 3.]
 [ 0.]
 [ 3.]
 [ 3.]
 [ 3.]
 [ 4.]
 [ 2.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 2.]
 [ 3.]
 [ 1.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 3.]
 [ 4.]
 [ 4.]
 [ 2.]
 [ 2.]
 [ 1.]
 [ 2.]
 [ 0.]
 [ 3.]
 [ 2.]
 [ 2.]
 [ 0.]
 [ 3.]
 [ 3.]
 [ 1.]
 [ 2.]
 [ 1.]
 [ 2.]
 [ 2.]
 [ 4.]
 [ 3.]
 [ 3.]
 [ 2.]
 [ 4.]
 [ 0.]
 [ 0.]
 [ 3.]
 [ 3.]
 [ 3.]
 [ 3.]
 [ 2.]
 [ 0.]
 [ 1.]
 [ 2.]
 [ 3.]
 [ 0.]
 [ 2.]
 [ 2.]
 [ 2.]
 [ 3.]
 [ 2.]
 [ 2.]
 [ 2.]
 [ 4.]
 [ 1.]
 [ 1.]
 [ 3.]
 [ 3.]
 [ 4.]
 [ 1.]
 [ 2.]
 [ 1.]
 [ 1.]
 [ 3.]
 [ 1.]
 [ 0.]
 [ 4.]
 [ 0.]
 [ 3.]
 [ 3.]
 [ 4.]
 [ 4.]
 [ 1.]
 [ 4.]
 [ 3.]
 [ 0.]
 [ 2.]]

检查测试集的性能

print("Training set:")
pred_train = predict(X_train, Y_train, W, b, word_to_vec_map)
print('Test set:')
pred_test = predict(X_test, Y_test, W, b, word_to_vec_map)

Training set:
Accuracy: 0.977272727273
Test set:
Accuracy: 0.857142857143

如果随机猜的话，只有20%的准确率，但是这里只需要127个样本，就可以达到这么好的性能。在训练集中，算法见过类似于"I love you"这样的句子，并标定为 ❤️。而"adore"没有在训练集出现过，下面看看"I adore you."会出来什么？

X_my_sentences = np.array(["i adore you", "i love you", "funny lol", "lets play with a ball", "food is ready", "not feeling happy"])
Y_my_labels = np.array([[0], [0], [2], [1], [4],[3]])

pred = predict(X_my_sentences, Y_my_labels , W, b, word_to_vec_map)
print_predictions(X_my_sentences, pred)

Accuracy: 0.833333333333

i adore you ❤️
i love you ❤️
funny lol 😄
lets play with a ball ⚾
food is ready 🍴
not feeling happy 😄

结果十分令人惊讶！这是因为"adore"和"love"具有类似的词嵌入，这个算法可以很好地泛化到训练集中没有出现过的单词。

但是这个算法不能理解"not feeling happy"。这是因为它忽略了词的顺序。

打印出混淆矩阵可以帮助我们理解哪些类别更难标定。混淆矩阵显示的是算法将真实的类别错误的划分为另一个类别的概率。

print(Y_test.shape)
print('           '+ label_to_emoji(0)+ '    ' + label_to_emoji(1) + '    ' +  label_to_emoji(2)+ '    ' + label_to_emoji(3)+'   ' + label_to_emoji(4))
print(pd.crosstab(Y_test, pred_test.reshape(56,), rownames=['Actual'], colnames=['Predicted'], margins=True))
plot_confusion_matrix(Y_test, pred_test)

(56,)
           ❤️    ⚾    😄    😞   🍴
Predicted  0.0  1.0  2.0  3.0  4.0  All
Actual                                 
0            6    0    0    1    0    7
1            0    8    0    0    0    8
2            2    0   16    0    0   18
3            1    1    2   12    0   16
4            0    0    1    0    6    7
All          9    9   19   13    6   56

png

从上面的例子可以看出：
- 尽管只有127个训练样本，但是仍然可以获得比较好的表情模型，主要是因为词向量带来的泛化能力
- Emojify-V1在形如"This movie is not good and not enjoyable"的句子表现不好，是因为它不能理解词的组合，仅仅是平均所有的词嵌入向量，没有注意词的顺序。下面将介绍一个性能更好的模型

Emojifier-V2：使用Keras中的LSTM层

下面建立一个LSTM模型，它将输入单词序列作为输入。这个模型能够将单词顺序考虑进去。Emojifier-V2也使用预训练好的词嵌入模型，但是将其输入到LSTM层中，预测最有可能的表情。

import numpy as np
np.random.seed(0)
from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.initializers import glorot_uniform
np.random.seed(1)

Using TensorFlow backend.

模型概述

下面是Emojifier-v2模型的示意图：

图 3: Emojifier-V2. 两个 LSTM层的序列分类模型 classifier.

Keras和mini-batching

在大多数深度学习框架中，相同的mini-batch中所有的序列长度相同，这样才可以向量化。如果一个句子有3个词，另一个句子有4个词，那么它们的计算量不一样（一个需要3步LSTM、一个需要4步LSTM，无法同时计算。

最常见的做法是使用补零。具体的，设置最长的序列长度，然后将所有的句子补到这么长。例如，设置最长的序列长度为20，我们可以让每个句子补零，使得每个输入的句子长度都为20.因此，像 "i love you"这样的句子就变成\((e_{i}, e_{love}, e_{you}, \vec{0}, \vec{0}, \ldots, \vec{0})\)。而任何长于20个单词的句子则会截断。

词嵌入层

在Keras中，词嵌入矩阵表示为一个层，并将一些列正实数（单词所对应的下标）映射为固定长度的向量（词嵌入向量）。该矩阵一般是预训练好了的。本文将介绍在Keras中如何创建Embedding()层，并将其初始化为之前加载过的GloVe 50维的词向量。由于我们的训练集很小，因此不会更新词嵌入模型，而是将其固定。但是下面的代码会显示如何训练以及固定词嵌入层。

Embedding()层输入一个整数矩阵，大小为(batch size, max input length)。代表的是句子转化为索引列表。如下图所示：

图 4: 词嵌入层. 该图显示如何将两个样本如何通过词嵌入层。两个样本都补零到长度为max_len=5。最后的输出为(2,max_len,50).

# GRADED FUNCTION: sentences_to_indices

def sentences_to_indices(X, word_to_index, max_len):
    """
    Converts an array of sentences (strings) into an array of indices corresponding to words in the sentences.
    The output shape should be such that it can be given to `Embedding()` (described in Figure 4). 
    
    Arguments:
    X -- array of sentences (strings), of shape (m, 1)
    word_to_index -- a dictionary containing the each word mapped to its index
    max_len -- maximum number of words in a sentence. You can assume every sentence in X is no longer than this. 
    
    Returns:
    X_indices -- array of indices corresponding to words in the sentences from X, of shape (m, max_len)
    """
    
    m = X.shape[0]                                   # number of training examples
    
    ### START CODE HERE ###
    # Initialize X_indices as a numpy matrix of zeros and the correct shape (≈ 1 line)
    X_indices = np.zeros((m, max_len))
    
    for i in range(m):                               # loop over training examples
        
        # Convert the ith training sentence in lower case and split is into words. You should get a list of words.
        sentence_words = X[i].lower().split()
        
        # Initialize j to 0
        j = 0
        
        # Loop over the words of sentence_words
        for w in sentence_words:
            # Set the (i,j)th entry of X_indices to the index of the correct word.
            X_indices[i, j] = word_to_index[w]
            # Increment j to j + 1
            j = j + 1
            
    ### END CODE HERE ###
    
    return X_indices

下面测试sentences_to_indices()

X1 = np.array(["funny lol", "lets play baseball", "food is ready for you"])
X1_indices = sentences_to_indices(X1,word_to_index, max_len = 5)
print("X1 =", X1)
print("X1_indices =", X1_indices)

X1 = ['funny lol' 'lets play baseball' 'food is ready for you']
X1_indices = [[ 155345.  225122.       0.       0.       0.]
 [ 220930.  286375.   69714.       0.       0.]
 [ 151204.  192973.  302254.  151349.  394475.]]

下面在Keras中用预训练好的词向量构建Embedding()层。当这些层已经构建了之后，sentences_to_indices()的输出作为词嵌入层的输入，最后返回该句子的词嵌入向量。

实现pretrained_embedding_layer()函数包括：
- 以正确的维度零初始化词嵌入矩阵
- 把所有从word_to_vec_map提取的词嵌入向量填入词嵌入矩阵
- 定义Keras词嵌入层。使用Embedding()，确保这一层不是可训练层。如果设置trainable = True，那么优化算法会修改词嵌入表示
- 设置词嵌入权值等于词嵌入矩阵

# GRADED FUNCTION: pretrained_embedding_layer

def pretrained_embedding_layer(word_to_vec_map, word_to_index):
    """
    Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.
    
    Arguments:
    word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    embedding_layer -- pretrained layer Keras instance
    """
    
    vocab_len = len(word_to_index) + 1                  # adding 1 to fit Keras embedding (requirement)
    emb_dim = word_to_vec_map["cucumber"].shape[0]      # define dimensionality of your GloVe word vectors (= 50)
    
    ### START CODE HERE ###
    # Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)
    emb_matrix = np.zeros((vocab_len, emb_dim))
    
    # Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary
    for word, index in word_to_index.items():
        emb_matrix[index, :] = word_to_vec_map[word]

    # Define Keras embedding layer with the correct output/input sizes, make it trainable. Use Embedding(...). Make sure to set trainable=False. 
    embedding_layer = Embedding(vocab_len, emb_dim)
    ### END CODE HERE ###

    # Build the embedding layer, it is required before setting the weights of the embedding layer. Do not modify the "None".
    embedding_layer.build((None,))
    
    # Set the weights of the embedding layer to the embedding matrix. Your layer is now pretrained.
    embedding_layer.set_weights([emb_matrix])
    
    return embedding_layer

1 2	embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index) print("weights[0][1][3] =", embedding_layer.get_weights()[0][1][3])

weights[0][1][3] = -0.3403

建立Emojifier-V2模型

使用上面建立的词嵌入层，将其输出作为LSTM网络的输入。

图 3: Emojifier-v2. 带2层LSTM的LSTM序列分类器.

模型的输入为句子序列，维度为(m, max_len, )，输出为softmax概率向量，维度为(m, C = 5)。需要用到的Keras函数有：

Input(shape = ..., dtype = '...'), LSTM(), Dropout(), Dense(), 和 Activation().

# GRADED FUNCTION: Emojify_V2

def Emojify_V2(input_shape, word_to_vec_map, word_to_index):
    """
    Function creating the Emojify-v2 model's graph.
    
    Arguments:
    input_shape -- shape of the input, usually (max_len,)
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    model -- a model instance in Keras
    """
    
    ### START CODE HERE ###
    # Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
    sentence_indices = Input(shape=input_shape, dtype=np.int32)
    
    # Create the embedding layer pretrained with GloVe Vectors (≈1 line)
    embedding_layer =  pretrained_embedding_layer(word_to_vec_map, word_to_index)
    
    # Propagate sentence_indices through your embedding layer, you get back the embeddings
    embeddings = embedding_layer(sentence_indices)   
    
    # Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
    # Be careful, the returned output should be a batch of sequences.
    X = LSTM(128, return_sequences=True)(embeddings)
    # Add dropout with a probability of 0.5
    X = Dropout(0.5)(X)
    # Propagate X trough another LSTM layer with 128-dimensional hidden state
    # Be careful, the returned output should be a single hidden state, not a batch of sequences.
    X = LSTM(128)(X)
    # Add dropout with a probability of 0.5
    X = Dropout(0.5)(X)
    # Propagate X through a Dense layer with softmax activation to get back a batch of 5-dimensional vectors.
    X = Dense(5, activation='softmax')(X)
    # Add a softmax activation
    X =  Activation('softmax')(X)
    
    # Create Model instance which converts sentence_indices into X.
    model = Model(sentence_indices, X)
    
    ### END CODE HERE ###
    
    return model

1 2	model = Emojify_V2((maxLen,), word_to_vec_map, word_to_index) model.summary()

WARNING:tensorflow:From /home/seisinv/anaconda3/envs/fwi_ai/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:1190: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 10)                0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 10, 50)            20000050  
_________________________________________________________________
lstm_1 (LSTM)                (None, 10, 128)           91648     
_________________________________________________________________
dropout_1 (Dropout)          (None, 10, 128)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 645       
_________________________________________________________________
activation_1 (Activation)    (None, 5)                 0         
=================================================================
Total params: 20,223,927
Trainable params: 20,223,927
Non-trainable params: 0
_________________________________________________________________

1	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

WARNING:tensorflow:From /home/seisinv/anaconda3/envs/fwi_ai/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:1297: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead

1 2	X_train_indices = sentences_to_indices(X_train, word_to_index, maxLen) Y_train_oh = convert_to_one_hot(Y_train, C = 5)

1	model.fit(X_train_indices, Y_train_oh, epochs = 50, batch_size = 32, shuffle=True)

Epoch 1/50
132/132 [==============================] - 5s - loss: 1.6086 - acc: 0.1667     
Epoch 2/50
132/132 [==============================] - 1s - loss: 1.5876 - acc: 0.3333     
Epoch 3/50
132/132 [==============================] - 1s - loss: 1.5725 - acc: 0.2652     
Epoch 4/50
132/132 [==============================] - 1s - loss: 1.5532 - acc: 0.3485     
Epoch 5/50
132/132 [==============================] - 1s - loss: 1.5397 - acc: 0.3030     
Epoch 6/50
132/132 [==============================] - 1s - loss: 1.5155 - acc: 0.3712     
Epoch 7/50
132/132 [==============================] - 1s - loss: 1.5170 - acc: 0.3409     
Epoch 8/50
132/132 [==============================] - 1s - loss: 1.4453 - acc: 0.4848     
Epoch 9/50
132/132 [==============================] - 1s - loss: 1.4056 - acc: 0.5152     
Epoch 10/50
132/132 [==============================] - 1s - loss: 1.3391 - acc: 0.6515     
Epoch 11/50
132/132 [==============================] - 1s - loss: 1.2934 - acc: 0.6439     
Epoch 12/50
132/132 [==============================] - 1s - loss: 1.2247 - acc: 0.7273     
Epoch 13/50
132/132 [==============================] - 1s - loss: 1.1986 - acc: 0.7727     
Epoch 14/50
132/132 [==============================] - 1s - loss: 1.1831 - acc: 0.7500     
Epoch 15/50
132/132 [==============================] - 1s - loss: 1.1318 - acc: 0.8258     
Epoch 16/50
132/132 [==============================] - 1s - loss: 1.1317 - acc: 0.7955     
Epoch 17/50
132/132 [==============================] - 1s - loss: 1.0946 - acc: 0.8182     
Epoch 18/50
132/132 [==============================] - 1s - loss: 1.0895 - acc: 0.8258     
Epoch 19/50
132/132 [==============================] - 1s - loss: 1.0386 - acc: 0.8864     
Epoch 20/50
132/132 [==============================] - 1s - loss: 1.0398 - acc: 0.8561     
Epoch 21/50
132/132 [==============================] - 2s - loss: 1.0059 - acc: 0.9091     
Epoch 22/50
132/132 [==============================] - 2s - loss: 1.0045 - acc: 0.9015     
Epoch 23/50
132/132 [==============================] - 2s - loss: 0.9920 - acc: 0.9394     
Epoch 24/50
132/132 [==============================] - 2s - loss: 0.9754 - acc: 0.9394     
Epoch 25/50
132/132 [==============================] - 2s - loss: 0.9598 - acc: 0.9545     
Epoch 26/50
132/132 [==============================] - 2s - loss: 0.9453 - acc: 0.9697     
Epoch 27/50
132/132 [==============================] - 2s - loss: 0.9460 - acc: 0.9621     
Epoch 28/50
132/132 [==============================] - 2s - loss: 0.9378 - acc: 0.9697     
Epoch 29/50
132/132 [==============================] - 2s - loss: 0.9406 - acc: 0.9697     
Epoch 30/50
132/132 [==============================] - 2s - loss: 0.9395 - acc: 0.9697     
Epoch 31/50
132/132 [==============================] - 2s - loss: 0.9388 - acc: 0.9697     
Epoch 32/50
132/132 [==============================] - 2s - loss: 0.9382 - acc: 0.9621     
Epoch 33/50
132/132 [==============================] - 2s - loss: 0.9316 - acc: 0.9773     
Epoch 34/50
132/132 [==============================] - 2s - loss: 0.9335 - acc: 0.9697     
Epoch 35/50
132/132 [==============================] - 2s - loss: 0.9312 - acc: 0.9773     
Epoch 36/50
132/132 [==============================] - 2s - loss: 0.9457 - acc: 0.9621     
Epoch 37/50
132/132 [==============================] - 2s - loss: 0.9287 - acc: 0.9773     
Epoch 38/50
132/132 [==============================] - 2s - loss: 0.9353 - acc: 0.9697     
Epoch 39/50
132/132 [==============================] - 2s - loss: 0.9328 - acc: 0.9697     
Epoch 40/50
132/132 [==============================] - 2s - loss: 0.9281 - acc: 0.9773     
Epoch 41/50
132/132 [==============================] - 2s - loss: 0.9314 - acc: 0.9773     
Epoch 42/50
132/132 [==============================] - 2s - loss: 0.9325 - acc: 0.9697     
Epoch 43/50
132/132 [==============================] - 1s - loss: 0.9356 - acc: 0.9697     
Epoch 44/50
132/132 [==============================] - 1s - loss: 0.9281 - acc: 0.9773     
Epoch 45/50
132/132 [==============================] - 1s - loss: 0.9285 - acc: 0.9773     
Epoch 46/50
132/132 [==============================] - 1s - loss: 0.9280 - acc: 0.9773     
Epoch 47/50
132/132 [==============================] - 1s - loss: 0.9271 - acc: 0.9773     
Epoch 48/50
132/132 [==============================] - 1s - loss: 0.9292 - acc: 0.9773     
Epoch 49/50
132/132 [==============================] - 1s - loss: 0.9334 - acc: 0.9697     
Epoch 50/50
132/132 [==============================] - 1s - loss: 0.9632 - acc: 0.9242     





<keras.callbacks.History at 0x7faab0bbad30>

测试集

X_test_indices = sentences_to_indices(X_test, word_to_index, max_len = maxLen)
Y_test_oh = convert_to_one_hot(Y_test, C = 5)
loss, acc = model.evaluate(X_test_indices, Y_test_oh)
print()
print("Test accuracy = ", acc)

32/56 [================>.............] - ETA: 0s
Test accuracy =  0.857142848628

# This code allows you to see the mislabelled examples
C = 5
y_test_oh = np.eye(C)[Y_test.reshape(-1)]
X_test_indices = sentences_to_indices(X_test, word_to_index, maxLen)
pred = model.predict(X_test_indices)
for i in range(len(X_test)):
    x = X_test_indices
    num = np.argmax(pred[i])
    if(num != Y_test[i]):
        print('Expected emoji:'+ label_to_emoji(Y_test[i]) + ' prediction: '+ X_test[i] + label_to_emoji(num).strip())

Expected emoji:😄 prediction: he got a very nice raise   😞
Expected emoji:😄 prediction: she got me a nice present  😞
Expected emoji:😄 prediction: Stop making this joke ha ha ha 😞
Expected emoji:😄 prediction: you brighten my day    ❤️
Expected emoji:😄 prediction: will you be my valentine   ❤️
Expected emoji:🍴 prediction: I am hungry😞
Expected emoji:😄 prediction: What you did was awesome   😞
Expected emoji:😞 prediction: go away    ⚾

# Change the sentence below to see your prediction. Make sure all the words are in the Glove embeddings.  
x_test = np.array(['not feeling happy'])
X_test_indices = sentences_to_indices(x_test, word_to_index, maxLen)
print(x_test[0] +' '+  label_to_emoji(np.argmax(model.predict(X_test_indices))))

not feeling happy 😞

从上面的例子可以看出，使用LSTM可以部分解决第一个版本的问题，但是对类似于"not happy"这样的句子表现仍然不是很好，这是因为训练集太小，没有太多的负例。

小结

在NLP任务中，如果你的训练集很小，使用词嵌入可以极大地改善算法性能。词嵌入允许你的模型对训练集没有出现过的样例也工作良好。
在Keras中训练序列模型需要注意以下几点：
-- 为了使用mini-batch，要将序列补零，以使得在mini-batch中所有样例长度相同
-- Embedding()层可以使用与训练好的模型进行初始化，也可以进一步使用你的数据集进一步训练。但是，如果带标签的训练集很小，通常没有必要再训练词嵌入模型
-- LSTM()层有一个参数return_sequences，表示是否返回每个隐藏状态还是仅仅返回最后一个
-- 可以在LSTM层之后使用Dropout()，正则化你的网络

参考资料

吴恩达，coursera深度学习课程
对话机器人