Language Translation Using Deep Learning

0
1094

Introduction Language Translation using Deep Learning

In this article we are going to Develop a model to perform Language Translation using Deep Learning to Automatically Translate from German to English in Python with Keras, Step-by-Step.

Machine translation is a challenging task that traditionally involves large statistical models developed using highly sophisticated linguistic knowledge.

In this article, you will perform how to develop a machine translation system for translating German into English.

Machine Translation is a challenging task to convert one language into another with Language Translation using Deep Learning. Generally, it involves statistical models. Once the model is built up then get a result quickly that is power off machine learning and statistical model. Here we are, we are going to use deep neural networks for the problem of machine translation. We will discover how to develop a neural machine translation model for Language Translation using Deep Learning.

Pre-Processing the Text Data

An important step in Natural Language Processing for modeling.

Some steps to clean the data.

  • Removing the Punctuation.
  • The text contains uppercase and lowercase.
  • Given text contains special characters in the German.
  • The file is ordered by sentence length with very long sentences toward the end of the file.

In this article, basic steps for data preparation is divided into two section

Clean Text

Load the data that preserves the Unicode German characters.The load_doc() command line helps us to load the file as a blob of text.

# load doc into memory
def load_data(filename):
    # open the file as read only
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
#here, use the one special parameter encoding that means read the english.

Every line contains a single pair of phrases, first English and then German, separated by a tab character.

We must split the loaded text by line and then by phrase.for  split given function to_pairs() will split the loaded text.

# split a loaded document into sentences

def to_pairs(doc):

    lines = doc.strip().split(‘\n’)

    pairs = [line.split(‘\t’) for line in  lines]

    return pairs

Now, this function is ready to clean each sentence.The specific cleaning operations we will perform are as follows:

  • Remove all non-printable characters.
  • Remove all punctuation characters.
  • Normalize all Unicode characters to ASCII (e.g. Latin characters).
  • Normalize the case to lowercase.
  • Remove any remaining tokens that are not alphabetic.

We will perform these operations on each phrase for each pair in the loaded dataset.

The clean_pairs() function below implements these operations.

# clean a list of lines
def clean_pairs(lines):
    cleaned = list()
    # prepare regex for char filtering
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    # prepare translation table for removing punctuation
    table = str.maketrans('', '', string.punctuation)
    for pair in lines:
   	 clean_pair = list()
   	 for line in pair:
   		 # normalize unicode characters
   		 line = normalize('NFD', line).encode('ascii', 'ignore')
   		 line = line.decode('UTF-8')
   		 # tokenize on white space
   		 line = line.split()
   		 # convert to lowercase
   		 line = [word.lower() for word in line]
   		 # remove punctuation from each token
   		 line = [word.translate(table) for word in line]
   		 # remove non-printable chars form each token
   		 line = [re_print.sub('', w) for w in line]
   		 # remove tokens with numbers in them
   		 line = [word for word in line if word.isalpha()]
   		 # store as string
   		 clean_pair.append(' '.join(line))
   	 cleaned.append(clean_pair)
    return array(cleaned)

Finally, now that the data has been cleaned, we can save the list of phrase pairs by using save_clean_data() function, use the pickle API and save it.After that save file ready for use.

# save a list of clean sentences to file
def save_clean_data(sentences, filename):
	dump(sentences, open(filename, 'wb'))
	print('Saved: %s' % filename)
 
# load dataset
filename = 'deu.txt'
doc = load_doc(filename)
# split into english-german pairs
pairs = to_pairs(doc)
# clean sentences
clean_pairs = clean_pairs(pairs)
# save clean pairs to file
save_clean_data(clean_pairs, 'english-german.pkl')
# spot check
for i in range(100):
	print('[%s] => [%s]' % (clean_pairs[i,0], clean_pairs[i,1]))
Output:
[hi] => [hallo]
[hi] => [gru gott]
[run] => [lauf]
[wow] => [potzdonner]
[wow] => [donnerwetter]
[fire] => [feuer]
[help] => [hilfe]
[help] => [zu hulf]
[stop] => [stopp]
[wait] => [warte]

Split Text

The clean data contains a little over 150,000 phrase pairs and some of the pairs toward the end of the file are very long.

This is a good number of examples for developing a small translation model. The complexity of the model increases with the number of examples, length of phrases, and size of the vocabulary. More examples are good for creating a large model.

Although we have a good dataset for modeling translation, we will simplify the problem slightly to dramatically reduce the size of the model required, and in turn the training time required to fit the model.

We will simplify the problem by reducing the dataset to the first 10,000 examples in the file; these will be the shortest phrases in the dataset.

Further, we will then take the first 9,000 of those as examples for training and the remaining 1,000 examples to test the fit model.

Below code is the complete example, of loading the clean data, splitting it, and saving the split portions of data to new files.

from pickle import load
from pickle import dump
from numpy.random import rand
from numpy.random import shuffle

# load a clean dataset
def load_clean_sentences(filename):
    return load(open(filename, 'rb'))

# save a list of clean sentences to file
def save_clean_data(sentences, filename):
    dump(sentences, open(filename, 'wb'))
    print('Saved: %s' % filename)

# load dataset
raw_dataset=load_clean_sentences('english_german.pkl')

# reduce dataset size
n_sentences = 10000
dataset = raw_dataset[:n_sentences, :]
# random shuffle
shuffle(dataset)
# split into train/test
train, test = dataset[:9000], dataset[9000:]
# save
save_clean_data(dataset, 'english_german_both.pkl')
save_clean_data(train, 'english_german_train.pkl')
save_clean_data(test, 'english_german_test.pkl')

Run the all example, get a three file in output.

  1. English_german_both.pkl that file contains all of the train and test examples.These examples used to define the parameters of the problem, such as vocabulary.
  2.  english_german_train.pkl file for train dataset
  3. english_german_test.pkl file for test dataset

Train the Language Translation Model

Now, load the both loading and preparing the clean text data ready for modeling and defining and training the model on the prepared data.

The load_clean_sentences() function can be used to load the train, test, and both datasets.

# load a clean dataset
def load_clean_sentences(filename):
    return load(open(filename, 'rb'))

# load datasets
dataset = load_clean_sentences('english-german-both.pkl')
train_data = load_clean_sentences('english-german-train.pkl')
test_data = load_clean_sentences('english-german-test.pkl')

We can use the Keras Tokenizer class to map words to integers. We will use a separate tokenizer for the English sequences and the German sequences. The function below-named create_tokenizer() will train a tokenizer on a list of phrases.

# fit a tokenizer
def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer
# max sentence length
def max_length(lines):
    return max(len(line.split()) for line in lines)

Above code, create a max_length function. Will find the length of the longest sequence in a list of phrases.

We create another function that can call these functions with the combined dataset to prepare tokenizers, vocabulary sizes, and maximum lengths for both the English and German phrases.

phrases.
# prepare english tokenizer
eng_tokenizer = create_tokenizer(dataset[:, 0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
eng_length = max_length(dataset[:, 0])
print('English Vocabulary Size: %d' % eng_vocab_size)
print('English Max Length: %d' % (eng_length))
# prepare german tokenizer
ger_tokenizer = create_tokenizer(dataset[:, 1])
ger_vocab_size = len(ger_tokenizer.word_index) + 1
ger_length = max_length(dataset[:, 1])
print('German Vocabulary Size: %d' % ger_vocab_size)
print('German Max Length: %d' % (ger_length))

Now ready train data.

Each input and output sequence must be encoded to integers and padded to the maximum phrase length. Because we will use a word embedding for the input sequences and one-hot encode the output sequences. The encode_sequence() function performs the these operations and returns the result.

# encode and pad sequences
def encode_sequences(tokenizer, length, lines):
    # integer encode sequences
    X = tokenizer.texts_to_sequences(lines)
    # pad sequences with 0 values
    X = pad_sequences(X, maxlen=length, padding='post')
    return X
	The output sequence needs to be one-hot encoded. This is because the model will predict the probability of each word in the vocabulary as output.
The function encode_output() below will one-hot encode English output sequences.
# one hot encode target sequence
def encode_output(sequences, vocab_size):
    ylist = list()
    for sequence in sequences:
   	encoded=to_categorical(sequence, num_classes=vocab_size)
   	 ylist.append(encoded)
    y = array(ylist)
    y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size)
    return y
	Now,create these two functions and prepare both the train and test dataset ready for training the model.
# prepare training data
trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1])
trainY = encode_sequences(eng_tokenizer, eng_length, train[:, 0])
trainY = encode_output(trainY, eng_vocab_size)
# prepare validation data
testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1])
testY = encode_sequences(eng_tokenizer, eng_length, test[:, 0])
testY = encode_output(testY, eng_vocab_size)
Now,We are now ready to define the model.

We will use an encoder-decoder LSTM model on this problem. In this model, the input sequence is encoded by a front-end model called the encoder then decoded word by word by a backend model called the decoder.

The define_model() function below defines the model and takes a number of arguments used to configure the model, such as the size of the input and output vocabularies, the maximum length of input and output phrases, and the number of memory units used to configure the model.

The model is trained using the efficient Adam approach to stochastic gradient descent and minimizes the categorical loss function because we have framed the prediction problem as multi-class classification.

# define NMT model
def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units):
    model = Sequential()
    model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True))
    model.add(LSTM(n_units))
    model.add(RepeatVector(tar_timesteps))
    model.add(LSTM(n_units, return_sequences=True))
    model.add(TimeDistributed(Dense(tar_vocab, activation='softmax')))
    return model

# define model
model = define_model(ger_vocab_size, eng_vocab_size, ger_length, eng_length, 256)
model.compile(optimizer='adam', loss='categorical_crossentropy')
# summarize defined model
print(model.summary())
plot_model(model, to_file='model.png', show_shapes=True)

Finally, we can train the model and 30 epochs and a batch size of 64 examples.
We use checkpointing to ensure that each time the model skill on the test set improves, the model is saved to file.
# fit model
filename = 'model.h5'
checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
model.fit(trainX, trainY, epochs=30, batch_size=64, validation_data=(testX, testY), callbacks=[checkpoint], verbose=2)
Running the example first prints a summary of the parameters of the dataset such as vocabulary size and maximum phrase lengths.Like,
English Vocabulary Size: 2404
English Max Length: 5
German Vocabulary Size: 3856
German Max Length: 10
Summary,

Layer (type)             	                Output Shape          	         Param 
embedding_1(Embedding)  	(None, 10, 256)       	987136

lstm_1 (LSTM)            	          (None, 256)           	         525312

repeat_vector_1 (RepeatVecto)  (None, 5, 256)        	0

lstm_2 (LSTM)                  	(None, 5, 256)          	525312

time_distributed_1 (TimeDist ) (None, 5, 2404)       	617828
=================================================================
Total params: 2,655,588
Trainable params: 2,655,588
Non-trainable params: 0

Next, the model is trained.

Each epoch takes about 30 seconds on modern CPU hardware; no GPU is required.

During the run, the model will be saved to the file model.h5, ready for inference in the next step.

Output like, 

Epoch 26/30

Epoch 00025: val_loss improved from 2.20048 to 2.19976, saving model to model.h5

17s – loss: 0.7114 – val_loss: 2.1998

Epoch 27/30

Epoch 00026: val_loss improved from 2.19976 to 2.18255, saving model to model.h5

17s – loss: 0.6532 – val_loss: 2.1826

Epoch 28/30

Epoch 00027: val_loss did not improve

17s – loss: 0.5970 – val_loss: 2.1970

Epoch 29/30

Epoch 00028: val_loss improved from 2.18255 to 2.17872, saving model to model.h5

17s – loss: 0.5474 – val_loss: 2.1787

Epoch 30/30

Epoch 00029: val_loss did not improve

17s – loss: 0.5023 – val_loss: 2.1823

Evaluate the Translation Model

Now, evaluate the model on the train and test data.

Ideally, we would use a separate validation dataset to help with model selection during training instead of the test set. You can try this as an extension.

# load datasets
dataset=load_clean_sentences('english-german-both.pkl)
train_data=load_clean_sentences('english-german-train.kl')
test_data=load_clean_sentences('english-german-test.p’)

# prepare english tokenizer
eng_tokenizer = create_tokenizer(dataset[:, 0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
eng_length = max_length(dataset[:, 0])
# prepare german tokenizer
ger_tokenizer = create_tokenizer(dataset[:, 1])
ger_vocab_size = len(ger_tokenizer.word_index) + 1
ger_length = max_length(dataset[:, 1])

# prepare data
trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1])
testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1])
#Save the model
# load model
model = load_model('model.h5')

Evaluation involves two steps: first generating a translated output sequence, and then repeating this process for many input examples and summarizing the skill of the model across multiple cases.

translation = model.predict(source, verbose=0)

Below function used for reverse mapping.

The function named as word_for_id(), will perform this reverse mapping.

# map an integer to a word
def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
   	 if index == integer:
   		 return word
    return None
The predict_sequence() function performs this operation for a single encoded source phrase.


# generate target given source sequence
def predict_sequence(model, tokenizer, source):
    prediction = model.predict(source, verbose=0)[0]
    integers = [argmax(vector) for vector in prediction]
    target = list()
    for i in integers:
   	 word = word_for_id(i, tokenizer)
   	 if word is None:
   		 break
   	 target.append(word)
    return ' '.join(target)

Next, we can repeat this step for source phrase in a dataset and compare the predicted result to the expected target phrase in English.

We can print some of these comparisons to the screen to get an idea of how the model performs in practice.

For testing, we will also calculate the BLEU scores to get a quantitative idea of how well the model has performed.

# evaluate the skill of the model
def evaluate_model(model, tokenizer, sources, raw_dataset):
    actual, predicted = list(), list()
    for i, source in enumerate(sources):
   	 # translate encoded source text
   	 source = source.reshape((1, source.shape[0]))
   	 translation = predict_sequence(model, eng_tokenizer, source)
   	 raw_target, raw_src = raw_dataset[i]
   	 if i < 10:
   		 print('src=[%s], target=[%s], predicted=[%s]' % (raw_src, raw_target, translation))
   	 actual.append([raw_target.split()])
   	 predicted.append(translation.split())
    # calculate BLEU score
    print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
    print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
    print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
    print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

Now, evaluate the loaded model on both the training and test datasets.

Output:

src=[er ist ein blodmann], target=[hes a jerk], predicted=[hes a jerk]

src=[ich bin brillentrager], target=[i wear glasses], predicted=[i wear glasses]

src=[tom hat mich aufgezogen], target=[tom raised me], predicted=[tom tricked me]

src=[ich zahle auf tom], target=[i count on tom], predicted=[ill call tom tom]

src=[ich kann rauch sehen], target=[i can see smoke], predicted=[i can help you]

src=[tom fuhlte sich einsam], target=[tom felt lonely], predicted=[tom felt uneasy]

src=[hab ich nicht recht], target=[am i wrong], predicted=[am i fat]

src=[gestatten sie mir zu gehen], target=[allow me to go], predicted=[do me to go]

src=[du hast mir gefehlt], target=[i missed you], predicted=[i missed you]

src=[es ist zu spat], target=[it is too late], predicted=[its too late]

BLEU-1: 0.844852

BLEU-2: 0.779819

BLEU-3: 0.699516

BLEU-4: 0.452614

  • Conclusion:

This article almost big and advanced level code. Cover the all basic process with the data cleaning process, tokenizer, and machine translation using the LSTM model. This code converts German to the English language.

LEAVE A REPLY

Please enter your comment!
Please enter your name here