循环神经网络生成字符级语言模型

本文介绍如何使用循环神经网络生成字符级语言模型。分别使用python和keras建立文本生成模型。一个是用来给恐龙取名称，另一个是学莎士比亚作诗。第一个模型用基本的RNN神经元，第二个模型用PSTM神经元。通过这两个应用，验证循环神经网络在自然语言处理中的巨大潜力。

1
2
3

import numpy as np
from utils import *
import random

问题描述

数据集和预处理

运行下面的单元，读取恐龙名称数据集，建立唯一单词序列（如a-z），并计算数据集和单词集的大小。

data = open('dinos.txt', 'r').read()
data= data.lower()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('There are %d total characters and %d unique characters in your data.' % (data_size, vocab_size))

There are 19909 total characters and 27 unique characters in your data.

字符集包括a-z（26个字符）和""（或者新行字符），类似于（或者句子的结尾），只不过这里表示恐龙名称结束。下面的单元建立一个python字典（即Hash表），将每个字符映射到0-26的数字索引，也建立一个python字典，将每个索引映射回对应的字符。这两个字典有助于在softmax层输出的概率分布中找到某个索引对应的字符。

1
2
3

char_to_ix = { ch:i for i,ch in enumerate(sorted(chars)) }
ix_to_char = { i:ch for i,ch in enumerate(sorted(chars)) }
print(ix_to_char)

{0: '\n', 1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z'}

模型简介

模型具有以下结构:

初始化参数
运行最优化循环
- 正传播计算损失函数
- 反传计算损失函数的梯度
- 裁剪梯度避免梯度爆炸
- 利用梯度下降法更新参数
返回学习的参数

图 1: 循环神经网络

在每个时间步, RNN根据前面的字符预测下一个字符. 数据集\(X = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, ..., x^{\langle T_x \rangle})\) 是训练集中的一系列字符, 而 \(Y = (y^{\langle 1 \rangle}, y^{\langle 2 \rangle}, ..., y^{\langle T_x \rangle})\) 是在时间 \(t\)上满足 \(y^{\langle t \rangle} = x^{\langle t+1 \rangle}\).

建立模型的基本元素

这部分建立完整模型的两个重要元素：
- 梯度裁剪：避免梯度爆炸
- 采样：一种生成字符的技术

优化循环中裁剪梯度

本节在优化循环中实现 clip函数，整个循环结构中，包括正传播、代价函数计算、反传播和模型更新。在更新模型之前，进行梯度裁剪，以保证梯度不会爆炸。

这里采用简单的元素裁剪方法，也就是让梯度向量中的每个元素裁剪到指定的范围[-N, N]。

图 2: 可视化在网络遇到梯度爆炸问题时，有无梯度裁剪对梯度下降法的影响。

### GRADED FUNCTION: clip

def clip(gradients, maxValue):
    '''
    Clips the gradients' values between minimum and maximum.
    
    Arguments:
    gradients -- a dictionary containing the gradients "dWaa", "dWax", "dWya", "db", "dby"
    maxValue -- everything above this number is set to this number, and everything less than -maxValue is set to -maxValue
    
    Returns: 
    gradients -- a dictionary with the clipped gradients.
    '''
    
    dWaa, dWax, dWya, db, dby = gradients['dWaa'], gradients['dWax'], gradients['dWya'], gradients['db'], gradients['dby']
   
    ### START CODE HERE ###
    # clip to mitigate exploding gradients, loop over [dWax, dWaa, dWya, db, dby]. (≈2 lines)
    for gradient in [dWax, dWaa, dWya, db, dby]:
        gradient.clip(-maxValue, maxValue, out=gradient)
    ### END CODE HERE ###
    
    gradients = {"dWaa": dWaa, "dWax": dWax, "dWya": dWya, "db": db, "dby": dby}
    
    return gradients

np.random.seed(3)
dWax = np.random.randn(5,3)*10
dWaa = np.random.randn(5,5)*10
dWya = np.random.randn(2,5)*10
db = np.random.randn(5,1)*10
dby = np.random.randn(2,1)*10
gradients = {"dWax": dWax, "dWaa": dWaa, "dWya": dWya, "db": db, "dby": dby}
gradients = clip(gradients, 10)
print("gradients[\"dWaa\"][1][2] =", gradients["dWaa"][1][2])
print("gradients[\"dWax\"][3][1] =", gradients["dWax"][3][1])
print("gradients[\"dWya\"][1][2] =", gradients["dWya"][1][2])
print("gradients[\"db\"][4] =", gradients["db"][4])
print("gradients[\"dby\"][1] =", gradients["dby"][1])

gradients["dWaa"][1][2] = 10.0
gradients["dWax"][3][1] = -10.0
gradients["dWya"][1][2] = 0.29713815361
gradients["db"][4] = [ 10.]
gradients["dby"][1] = [ 8.45833407]

采样

假定模型已经训练好了，你希望用它来生成新的文本。生成过程如下图所示：

图 3: 图中假定模型已经训练好. 在第一个时间步传入 \(x^{\langle 1\rangle} = \vec{0}\), 然后从网络中一次采样一个字符.

流程:

步骤 1: 对网络传入第一个 "假的" 输入 \(x^{\langle 1 \rangle} = \vec{0}\) (零向量). 这是生成任何字符前的缺省输入. 同时令 \(a^{\langle 0 \rangle} = \vec{0}\)
步骤 2: 正传播一次得到 \(a^{\langle 1 \rangle}\) 和 \(\hat{y}^{\langle 1 \rangle}\). 方程如下:

\[ a^{\langle t+1 \rangle} = \tanh(W_{ax} x^{\langle t \rangle } + W_{aa} a^{\langle t \rangle } + b)\tag{1}\]

\[ z^{\langle t + 1 \rangle } = W_{ya} a^{\langle t + 1 \rangle } + b_y \tag{2}\]

\[ \hat{y}^{\langle t+1 \rangle } = softmax(z^{\langle t + 1 \rangle })\tag{3}\]

注意 \(\hat{y}^{\langle t+1 \rangle }\) 是(softmax) 概率向量 (其元素介于0到1之间，并且求和为1). \(\hat{y}^{\langle t+1 \rangle}_i\) 代表索引为 "i" 对应的字符是下一个字符的概率.

步骤 3: 执行采样过程: 在 \(\hat{y}^{\langle t+1 \rangle }\)定义的概率分布下，选择下一个字符的索引. 意味着如果 \(\hat{y}^{\langle t+1 \rangle }_i = 0.16\), 表示你有16% 的概率选择索引 "i". 实现这个功能，使用函数 np.random.choice.

下面是使用np.random.choice()的一个例子:

1
2
3

np.random.seed(0)
p = np.array([0.1, 0.0, 0.7, 0.2])
index = np.random.choice([0, 1, 2, 3], p = p.ravel())

意味着你会根据分布: \(P(index = 0) = 0.1, P(index = 1) = 0.0, P(index = 2) = 0.7, P(index = 3) = 0.2\)，选择索引.

步骤 4: sample()的最后一步是覆盖变量 x, 它此时存储 \(x^{\langle t \rangle }\)的值是 \(x^{\langle t + 1 \rangle }\). \(x^{\langle t + 1 \rangle }\) 将由预测后选择的字符进行one-hot编码之后产生. 然后继续从第一步正传 \(x^{\langle t + 1 \rangle }\) 直到遇到 "" 字符, 表示你已经到了恐龙名字的结尾.

# GRADED FUNCTION: sample

def sample(parameters, char_to_ix, seed):
    """
    Sample a sequence of characters according to a sequence of probability distributions output of the RNN

    Arguments:
    parameters -- python dictionary containing the parameters Waa, Wax, Wya, by, and b. 
    char_to_ix -- python dictionary mapping each character to an index.
    seed -- used for grading purposes. Do not worry about it.

    Returns:
    indices -- a list of length n containing the indices of the sampled characters.
    """
    
    # Retrieve parameters and relevant shapes from "parameters" dictionary
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    vocab_size = by.shape[0]
    n_a = Waa.shape[1]
    
    ### START CODE HERE ###
    # Step 1: Create the one-hot vector x for the first character (initializing the sequence generation). (≈1 line)
    x = np.zeros((vocab_size, 1))
    # Step 1': Initialize a_prev as zeros (≈1 line)
    a_prev = np.zeros((n_a, 1))
    
    # Create an empty list of indices, this is the list which will contain the list of indices of the characters to generate (≈1 line)
    indices = []
    
    # Idx is a flag to detect a newline character, we initialize it to -1
    idx = -1 
    
    # Loop over time-steps t. At each time-step, sample a character from a probability distribution and append 
    # its index to "indices". We'll stop if we reach 50 characters (which should be very unlikely with a well 
    # trained model), which helps debugging and prevents entering an infinite loop. 
    counter = 0
    newline_character = char_to_ix['\n']
    
    while (idx != newline_character and counter != 50):
        
        # Step 2: Forward propagate x using the equations (1), (2) and (3)
        a = np.tanh(np.dot(Wax, x) + np.dot(Waa, a_prev) + b)
        z = np.dot(Wya, a) + by
        y = softmax(z)
        
        # for grading purposes
        np.random.seed(counter+seed) 
        
        # Step 3: Sample the index of a character within the vocabulary from the probability distribution y
        idx = np.random.choice(list(range(vocab_size)), p = y.ravel())

        # Append the index to "indices"
        indices.append(idx)
        
        # Step 4: Overwrite the input character as the one corresponding to the sampled index.
        x = np.zeros((vocab_size, 1))
        x[idx] = 1
        
        # Update "a_prev" to be "a"
        a_prev = a
        
        # for grading purposes
        seed += 1
        counter +=1
        
    ### END CODE HERE ###

    if (counter == 50):
        indices.append(char_to_ix['\n'])
    
    return indices

np.random.seed(2)
_, n_a = 20, 100
Wax, Waa, Wya = np.random.randn(n_a, vocab_size), np.random.randn(n_a, n_a), np.random.randn(vocab_size, n_a)
b, by = np.random.randn(n_a, 1), np.random.randn(vocab_size, 1)
parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "b": b, "by": by}


indices = sample(parameters, char_to_ix, 0)
print("Sampling:")
print("list of sampled indices:", indices)
print("list of sampled characters:", [ix_to_char[i] for i in indices])

Sampling:
list of sampled indices: [12, 17, 24, 14, 13, 9, 10, 22, 24, 6, 13, 11, 12, 6, 21, 15, 21, 14, 3, 2, 1, 21, 18, 24, 7, 25, 6, 25, 18, 10, 16, 2, 3, 8, 15, 12, 11, 7, 1, 12, 10, 2, 7, 7, 11, 5, 6, 12, 7, 11, 0]
list of sampled characters: ['l', 'q', 'x', 'n', 'm', 'i', 'j', 'v', 'x', 'f', 'm', 'k', 'l', 'f', 'u', 'o', 'u', 'n', 'c', 'b', 'a', 'u', 'r', 'x', 'g', 'y', 'f', 'y', 'r', 'j', 'p', 'b', 'c', 'h', 'o', 'l', 'k', 'g', 'a', 'l', 'j', 'b', 'g', 'g', 'k', 'e', 'f', 'l', 'g', 'k', '\n']

构建语言模型

梯度下降法

本节实现随机梯度下降法的一步（包括梯度裁剪）。每次循环只使用一个训练样本，因此是随机梯度下降法。RNN常规的优化流程包括：
- 沿着RNN正传计算损失函数
- 沿着时间反传计算损失函数关于参数的梯度
- 如果需要，对梯度进行裁剪
- 利用梯度下降法更新参数

已经提供的函数包括：

def rnn_forward(X, Y, a_prev, parameters):
    """ Performs the forward propagation through the RNN and computes the cross-entropy loss.
    It returns the loss' value as well as a "cache" storing values to be used in the backpropagation."""
    ....
    return loss, cache
    
def rnn_backward(X, Y, parameters, cache):
    """ Performs the backward propagation through time to compute the gradients of the loss with respect
    to the parameters. It returns also all the hidden states."""
    ...
    return gradients, a

def update_parameters(parameters, gradients, learning_rate):
    """ Updates parameters using the Gradient Descent Update Rule."""
    ...
    return parameters

# GRADED FUNCTION: optimize

def optimize(X, Y, a_prev, parameters, learning_rate = 0.01):
    """
    Execute one step of the optimization to train the model.
    
    Arguments:
    X -- list of integers, where each integer is a number that maps to a character in the vocabulary.
    Y -- list of integers, exactly the same as X but shifted one index to the left.
    a_prev -- previous hidden state.
    parameters -- python dictionary containing:
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        b --  Bias, numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
    learning_rate -- learning rate for the model.
    
    Returns:
    loss -- value of the loss function (cross-entropy)
    gradients -- python dictionary containing:
                        dWax -- Gradients of input-to-hidden weights, of shape (n_a, n_x)
                        dWaa -- Gradients of hidden-to-hidden weights, of shape (n_a, n_a)
                        dWya -- Gradients of hidden-to-output weights, of shape (n_y, n_a)
                        db -- Gradients of bias vector, of shape (n_a, 1)
                        dby -- Gradients of output bias vector, of shape (n_y, 1)
    a[len(X)-1] -- the last hidden state, of shape (n_a, 1)
    """
    
    ### START CODE HERE ###
    
    # Forward propagate through time (≈1 line)
    loss, cache = rnn_forward(X, Y, a_prev, parameters)
    
    # Backpropagate through time (≈1 line)
    gradients, a = rnn_backward(X, Y, parameters, cache)
    
    # Clip your gradients between -5 (min) and 5 (max) (≈1 line)
    gradients = clip(gradients, 5)
    
    # Update parameters (≈1 line)
    parameters = update_parameters(parameters, gradients, learning_rate)
    
    ### END CODE HERE ###
    
    return loss, gradients, a[len(X)-1]

np.random.seed(1)
vocab_size, n_a = 27, 100
a_prev = np.random.randn(n_a, 1)
Wax, Waa, Wya = np.random.randn(n_a, vocab_size), np.random.randn(n_a, n_a), np.random.randn(vocab_size, n_a)
b, by = np.random.randn(n_a, 1), np.random.randn(vocab_size, 1)
parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "b": b, "by": by}
X = [12,3,5,11,22,3]
Y = [4,14,11,22,25, 26]

loss, gradients, a_last = optimize(X, Y, a_prev, parameters, learning_rate = 0.01)
print("Loss =", loss)
print("gradients[\"dWaa\"][1][2] =", gradients["dWaa"][1][2])
print("np.argmax(gradients[\"dWax\"]) =", np.argmax(gradients["dWax"]))
print("gradients[\"dWya\"][1][2] =", gradients["dWya"][1][2])
print("gradients[\"db\"][4] =", gradients["db"][4])
print("gradients[\"dby\"][1] =", gradients["dby"][1])
print("a_last[4] =", a_last[4])

Loss = 126.503975722
gradients["dWaa"][1][2] = 0.194709315347
np.argmax(gradients["dWax"]) = 93
gradients["dWya"][1][2] = -0.007773876032
gradients["db"][4] = [-0.06809825]
gradients["dby"][1] = [ 0.01538192]
a_last[4] = [-1.]

训练模型

规定恐龙名称数据集，数据集的每一行对应一个训练样本。随机梯度下降法每隔100步，随机采样10个名称，以查看算法如何工作。记住，训练前要随机重组数据集，以保证随机梯度下降法可以随机地访问训练样本。

每个样本包含一个恐龙名称（字符串）。为了建立样本 (X, Y), 使用下面的代码:

1
2
3

index = j % len(examples)
X = [None] + [char_to_ix[ch] for ch in examples[index]] 
Y = X[1:] + [char_to_ix["\n"]]

注意: index= j % len(examples), 其中 j = 1....num_iterations,保证 examples[index] 一直是有效的 (index 小于 len(examples)). 第一个元素 X 为 None ，在 rnn_forward() 中设置为\(x^{\langle 0 \rangle} = \vec{0}\). 这样，保证了 Y 和 X 相同，但是往左移动了一个字符, 并在最后增加了字符 ""，以表示恐龙名称结束.

# GRADED FUNCTION: model

def model(data, ix_to_char, char_to_ix, num_iterations = 35000, n_a = 50, dino_names = 7, vocab_size = 27):
    """
    Trains the model and generates dinosaur names. 
    
    Arguments:
    data -- text corpus
    ix_to_char -- dictionary that maps the index to a character
    char_to_ix -- dictionary that maps a character to an index
    num_iterations -- number of iterations to train the model for
    n_a -- number of units of the RNN cell
    dino_names -- number of dinosaur names you want to sample at each iteration. 
    vocab_size -- number of unique characters found in the text, size of the vocabulary
    
    Returns:
    parameters -- learned parameters
    """
    
    # Retrieve n_x and n_y from vocab_size
    n_x, n_y = vocab_size, vocab_size
    
    # Initialize parameters
    parameters = initialize_parameters(n_a, n_x, n_y)
    
    # Initialize loss (this is required because we want to smooth our loss, don't worry about it)
    loss = get_initial_loss(vocab_size, dino_names)
    
    # Build list of all dinosaur names (training examples).
    with open("dinos.txt") as f:
        examples = f.readlines()
    examples = [x.lower().strip() for x in examples]
    
    # Shuffle list of all dinosaur names
    np.random.seed(0)
    np.random.shuffle(examples)
    
    # Initialize the hidden state of your LSTM
    a_prev = np.zeros((n_a, 1))
    
    # Optimization loop
    for j in range(num_iterations):
        
        ### START CODE HERE ###
        
        # Use the hint above to define one training example (X,Y) (≈ 2 lines)
        index = j % len(examples)
        X = [None] + [char_to_ix[ch] for ch in examples[index]]
        Y = X[1:] + [char_to_ix["\n"]]
        
        # Perform one optimization step: Forward-prop -> Backward-prop -> Clip -> Update parameters
        # Choose a learning rate of 0.01
        curr_loss, gradients, a_prev = optimize(X, Y, a_prev, parameters, learning_rate = 0.01)
        
        ### END CODE HERE ###
        
        # Use a latency trick to keep the loss smooth. It happens here to accelerate the training.
        loss = smooth(loss, curr_loss)

        # Every 2000 Iteration, generate "n" characters thanks to sample() to check if the model is learning properly
        if j % 2000 == 0:
            
            print('Iteration: %d, Loss: %f' % (j, loss) + '\n')
            
            # The number of dinosaur names to print
            seed = 0
            for name in range(dino_names):
                
                # Sample indices and print them
                sampled_indices = sample(parameters, char_to_ix, seed)
                print_sample(sampled_indices, ix_to_char)
                
                seed += 1  # To get the same result for grading purposed, increment the seed by one. 
      
            print('\n')
        
    return parameters

1	parameters = model(data, ix_to_char, char_to_ix)

Iteration: 0, Loss: 23.087336

Nkzxwtdmfqoeyhsqwasjkjvu
Kneb
Kzxwtdmfqoeyhsqwasjkjvu
Neb
Zxwtdmfqoeyhsqwasjkjvu
Eb
Xwtdmfqoeyhsqwasjkjvu


Iteration: 2000, Loss: 27.884160

Liusskeomnolxeros
Hmdaairus
Hytroligoraurus
Lecalosapaus
Xusicikoraurus
Abalpsamantisaurus
Tpraneronxeros


Iteration: 4000, Loss: 25.901815

Mivrosaurus
Inee
Ivtroplisaurus
Mbaaisaurus
Wusichisaurus
Cabaselachus
Toraperlethosdarenitochusthiamamumamaon


Iteration: 6000, Loss: 24.608779

Onwusceomosaurus
Lieeaerosaurus
Lxussaurus
Oma
Xusteonosaurus
Eeahosaurus
Toreonosaurus


Iteration: 8000, Loss: 24.070350

Onxusichepriuon
Kilabersaurus
Lutrodon
Omaaerosaurus
Xutrcheps
Edaksoje
Trodiktonus


Iteration: 10000, Loss: 23.844446

Onyusaurus
Klecalosaurus
Lustodon
Ola
Xusodonia
Eeaeosaurus
Troceosaurus


Iteration: 12000, Loss: 23.291971

Onyxosaurus
Kica
Lustrepiosaurus
Olaagrraiansaurus
Yuspangosaurus
Eealosaurus
Trognesaurus


Iteration: 14000, Loss: 23.382339

Meutromodromurus
Inda
Iutroinatorsaurus
Maca
Yusteratoptititan
Ca
Troclosaurus


Iteration: 16000, Loss: 23.310540

Meuspsapcosaurus
Inda
Iuspsarciiasauruimphis
Macabosaurus
Yusociman
Caagosaurus
Trrasaurus


Iteration: 18000, Loss: 22.846780

Phytrohekosaurus
Mggaaeschachynthalus
Mxsstarasomus
Pegahosaurus
Yusidon
Ehantor
Troma


Iteration: 20000, Loss: 22.921921

Meutroinepheusaurus
Lola
Lytrogonosaurus
Macalosaurus
Ytrpangricrlosaurus
Elagosaurus
Trochiqkaurus


Iteration: 22000, Loss: 22.758659

Meutrodon
Lola
Mustodon
Necagpsancosaurus
Yuspengromus
Eiadrus
Trochepomushus


Iteration: 24000, Loss: 22.671906

Mitrseitan
Jogaacosaurus
Kurroelathateraterachus
Mecalosaurus
Yusicheosaurus
Eiaeosaurus
Trodon


Iteration: 26000, Loss: 22.628685

Niusaurus
Liceaitona
Lytrodon
Necagrona
Ytrodon
Ejaertalenthomenxtheychosaurus
Trochinictititanhimus


Iteration: 28000, Loss: 22.635401

Plytosaurus
Lmacaisaurus
Mustocephictesaurus
Pacaksela
Xtspanesaurus
Eiaestedantes
Trocenitaudosantenithamus


Iteration: 30000, Loss: 22.627572

Piutysaurus
Micaahus
Mustodongoptes
Pacagsicelosaurus
Wustapisaurus
Eg
Trochesaurus


Iteration: 32000, Loss: 22.250284

Mautosaurus
Kraballona
Lyusianasaurus
Macallona
Xustanarhasauruiraplos
Efaiosaurus
Trodondonsaurukusaurukusaurus


Iteration: 34000, Loss: 22.477635

Nivosaurus
Libaadosaurus
Lutosaurus
Nebahosaurus
Wrosaurus
Eiaeosaurus
Spidonosaurus

学莎士比亚写作

和根据恐龙名称数据集给恐龙取名字类似，我们也可以利用莎士比亚的诗集，学莎士比亚写诗。利用LSTM神经元，你可以学习文本中跨越很多个字符之间的依赖关系。这种长期的依赖关系对给恐龙取名字不重要，因为名字十分短。

莎士比亚像!

from __future__ import print_function
from keras.callbacks import LambdaCallback
from keras.models import Model, load_model, Sequential
from keras.layers import Dense, Activation, Dropout, Input, Masking
from keras.layers import LSTM
from keras.utils.data_utils import get_file
from keras.preprocessing.sequence import pad_sequences
from shakespeare_utils import *
import sys
import io

Using TensorFlow backend.


Loading text data...
Creating training set...
number of training examples: 31412
Vectorizing training set...
Loading model...
WARNING:tensorflow:From /home/seisinv/anaconda3/envs/fwi_ai/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:1190: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /home/seisinv/anaconda3/envs/fwi_ai/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:1297: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead

加载已经训练好的模型，该模型使用的训练集是"The Sonnets"，已经迭代了大概1000次。可以在此基础上再迭代一次。训练结束后，输入一句话（少于40个字符），这个模型便会开始完成一首诗。

1
2
3

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

model.fit(x, y, batch_size=128, epochs=1, callbacks=[print_callback])

Epoch 1/1
31412/31412 [==============================] - 144s - loss: 2.7234   





<keras.callbacks.History at 0x7f8a219f9fd0>

1 2	# Run this cell to try with different inputs without having to re-train the model generate_output()

Write the beginning of your poem, the Shakespeare machine will complete it. Your input is: k


Here is your poem: 

k.
and haply woomers from that thinger dose,
which i ameng thes world the heart expors,
have breat lesst wist'seng with trilh, that now before,
a swill mince theraw ant thy dasuin torde.
 


lay nor repers manthl in haby thou sight,
for that befair portoonss that the beind,
love hall makile strenb which doth ever so,
nor galtend with mind and sho send the belest,
lerentew yith thes incey alonc whic

和前面给恐龙取名的RNN网络不同之处在于：
- 使用LSTM，而不是基本的RNN，获取更长范围的依赖关系
- 模型更深，两层LSTM模型
- 利用Keras，而不是python简化建模过程

可以参考Keras团队在github上关于文本生成的实现：https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py.

结论

本文详细介绍了RNN网络的一个应用场景---文本生成模型。通过使用文本数据集，训练语言模型，通过采样函数，生成新的文本。根据模型所要求的依赖关系，可以选择是否使用长范围依赖的LSTM模型或者基本的RNN模型。

参考文献

吴恩达，coursera深度学习课程
本作业的想法来自Andrej Karpathy的实现: https://gist.github.com/karpathy/d4dee566867f8291f086. 想了解更多内容，可以参考 Karpathy 博客.
关于作诗模型，本实现参考了Keras团队的实现过程，具体参考: https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py