How to Develop a Neural Machine Translation System from Scratch

Develop a Deep Learning Model to Automatically
Translate from German to English in Python with Keras, Step-by-Step.

Machine translation is a challenging task that traditionally involves large statistical models developed using highly sophisticated linguistic knowledge.

Neural machine translation is the use of deep neural networks for the problem of machine translation.

In this tutorial, you will discover how to develop a neural machine translation system for translating German phrases to English.

After completing this tutorial, you will know:

How to clean and prepare data ready to train a neural machine translation system.

How to develop an encoder-decoder model for machine translation.
How to use a trained model for inference on new input phrases and evaluate the model skill.

Let’s get started.

How to Develop a Neural Machine Translation System in Keras
Photo by Björn Groß, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

German to English Translation Dataset
Preparing the Text Data
Train Neural Translation Model
Evaluate Neural Translation Model

Python Environment

This tutorial assumes you have a Python 3 SciPy environment installed.

You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have NumPy and Matplotlib installed.

If you need help with your environment, see this post:

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

German to English Translation Dataset

In this tutorial, we will use a dataset of German to English terms used as the basis for flashcards for language learning.

The dataset is available from the ManyThings.org website, with examples drawn from the Tatoeba Project. The dataset is comprised of German phrases and their English counterparts and is intended to be used with the Anki flashcard software.

The page provides a list of many language pairs, and I encourage you to explore other languages:

The dataset we will use in this tutorial is available for download here:

Download the dataset to your current working directory and decompress; for example:

unzip deu-eng.zip

1	unzip deu-eng.zip

You will have a file called deu.txt that contains 152,820 pairs of English to German phases, one pair per line with a tab separating the language.

For example, the first 5 lines of the file look as follows:

Hi.	Hallo!
Hi.	Grüß Gott!
Run!	Lauf!
Wow!	Potzdonner!
Wow!	Donnerwetter!

12345

Hi. Hallo!Hi. Grüß Gott!Run! Lauf!Wow! Potzdonner!Wow! Donnerwetter!

We will frame the prediction problem as given a sequence of words in German as input, translate or predict the sequence of words in English.

The model we will develop will be suitable for some beginner German phrases.

Preparing the Text Data

The next step is to prepare the text data ready for modeling.

Take a look at the raw data and note what you see that we might need to handle in a data cleaning operation.

For example, here are some observations I note from reviewing the raw data:

There is punctuation.
The text contains uppercase and lowercase.
There are special characters in the German.
There are duplicate phrases in English with different translations in German.
The file is ordered by sentence length with very long sentences toward the end of the file.

Did you note anything else that could be important?
Let me know in the comments below.

A good text cleaning procedure may handle some or all of these observations.

Data preparation is divided into two subsections:

Clean Text
Split Text

1. Clean Text

First, we must load the data in a way that preserves the Unicode German characters. The function below called load_doc() will load the file as a blob of text.

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, mode='rt', encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

123456789

# load doc into memorydef load_doc(filename):# open the file as read onlyfile=open(filename,mode='rt',encoding='utf-8')# read all texttext=file.read()# close the filefile.close()returntext

Each line contains a single pair of phrases, first English and then German, separated by a tab character.

We must split the loaded text by line and then by phrase. The function to_pairs() below will split the loaded text.

# split a loaded document into sentences
def to_pairs(doc):
	lines = doc.strip().split('\n')
	pairs = [line.split('\t') for line in  lines]
	return pairs

12345

# split a loaded document into sentencesdef to_pairs(doc):lines=doc.strip().split('\n')pairs=[line.split('\t')forline inlines]returnpairs

We are now ready to clean each sentence. The specific cleaning operations we will perform are as follows:

Remove all non-printable characters.
Remove all punctuation characters.
Normalize all Unicode characters to ASCII (e.g. Latin characters).
Normalize the case to lowercase.
Remove any remaining tokens that are not alphabetic.

We will perform these operations on each phrase for each pair in the loaded dataset.

The clean_pairs() function below implements these operations.

# clean a list of lines
def clean_pairs(lines):
	cleaned = list()
	# prepare regex for char filtering
	re_print = re.compile('[^%s]' % re.escape(string.printable))
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for pair in lines:
		clean_pair = list()
		for line in pair:
			# normalize unicode characters
			line = normalize('NFD', line).encode('ascii', 'ignore')
			line = line.decode('UTF-8')
			# tokenize on white space
			line = line.split()
			# convert to lowercase
			line = [word.lower() for word in line]
			# remove punctuation from each token
			line = [word.translate(table) for word in line]
			# remove non-printable chars form each token
			line = [re_print.sub('', w) for w in line]
			# remove tokens with numbers in them
			line = [word for word in line if word.isalpha()]
			# store as string
			clean_pair.append(' '.join(line))
		cleaned.append(clean_pair)
	return array(cleaned)

123456789101112131415161718192021222324252627

# clean a list of linesdef clean_pairs(lines):cleaned=list()# prepare regex for char filteringre_print=re.compile('[^%s]'%re.escape(string.printable))# prepare translation table for removing punctuationtable=str.maketrans('','',string.punctuation)forpair inlines:clean_pair=list()forline inpair:# normalize unicode charactersline=normalize('NFD',line).encode('ascii','ignore')line=line.decode('UTF-8')# tokenize on white spaceline=line.split()# convert to lowercaseline=[word.lower()forwordinline]# remove punctuation from each tokenline=[word.translate(table)forwordinline]# remove non-printable chars form each tokenline=[re_print.sub('',w)forwinline]# remove tokens with numbers in themline=[wordforwordinline ifword.isalpha()]# store as stringclean_pair.append(' '.join(line))cleaned.append(clean_pair)returnarray(cleaned)

Finally, now that the data has been cleaned, we can save the list of phrase pairs to a file ready for use.

The function save_clean_data() uses the pickle API to save the list of clean text to file.

Pulling all of this together, the complete example is listed below.

import string
import re
from pickle import dump
from unicodedata import normalize
from numpy import array

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, mode='rt', encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# split a loaded document into sentences
def to_pairs(doc):
	lines = doc.strip().split('\n')
	pairs = [line.split('\t') for line in  lines]
	return pairs

# clean a list of lines
def clean_pairs(lines):
	cleaned = list()
	# prepare regex for char filtering
	re_print = re.compile('[^%s]' % re.escape(string.printable))
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for pair in lines:
		clean_pair = list()
		for line in pair:
			# normalize unicode characters
			line = normalize('NFD', line).encode('ascii', 'ignore')
			line = line.decode('UTF-8')
			# tokenize on white space
			line = line.split()
			# convert to lowercase
			line = [word.lower() for word in line]
			# remove punctuation from each token
			line = [word.translate(table) for word in line]
			# remove non-printable chars form each token
			line = [re_print.sub('', w) for w in line]
			# remove tokens with numbers in them
			line = [word for word in line if word.isalpha()]
			# store as string
			clean_pair.append(' '.join(line))
		cleaned.append(clean_pair)
	return array(cleaned)

# save a list of clean sentences to file
def save_clean_data(sentences, filename):
	dump(sentences, open(filename, 'wb'))
	print('Saved: %s' % filename)

# load dataset
filename = 'deu.txt'
doc = load_doc(filename)
# split into english-german pairs
pairs = to_pairs(doc)
# clean sentences
clean_pairs = clean_pairs(pairs)
# save clean pairs to file
save_clean_data(clean_pairs, 'english-german.pkl')
# spot check
for i in range(100):
	print('[%s] => [%s]' % (clean_pairs[i,0], clean_pairs[i,1]))

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667

import stringimport refrom pickle import dumpfrom unicodedata import normalizefrom numpy import array# load doc into memorydef load_doc(filename):# open the file as read onlyfile=open(filename,mode='rt',encoding='utf-8')# read all texttext=file.read()# close the filefile.close()returntext# split a loaded document into sentencesdef to_pairs(doc):lines=doc.strip().split('\n')pairs=[line.split('\t')forline inlines]returnpairs# clean a list of linesdef clean_pairs(lines):cleaned=list()# prepare regex for char filteringre_print=re.compile('[^%s]'%re.escape(string.printable))# prepare translation table for removing punctuationtable=str.maketrans('','',string.punctuation)forpair inlines:clean_pair=list()forline inpair:# normalize unicode charactersline=normalize('NFD',line).encode('ascii','ignore')line=line.decode('UTF-8')# tokenize on white spaceline=line.split()# convert to lowercaseline=[word.lower()forwordinline]# remove punctuation from each tokenline=[word.translate(table)forwordinline]# remove non-printable chars form each tokenline=[re_print.sub('',w)forwinline]# remove tokens with numbers in themline=[wordforwordinline ifword.isalpha()]# store as stringclean_pair.append(' '.join(line))cleaned.append(clean_pair)returnarray(cleaned)# save a list of clean sentences to filedef save_clean_data(sentences,filename):dump(sentences,open(filename,'wb'))print('Saved: %s'%filename)# load datasetfilename='deu.txt'doc=load_doc(filename)# split into english-german pairspairs=to_pairs(doc)# clean sentencesclean_pairs=clean_pairs(pairs)# save clean pairs to filesave_clean_data(clean_pairs,'english-german.pkl')# spot checkforiinrange(100):print('[%s] => [%s]'%(clean_pairs[i,0],clean_pairs[i,1]))

Running the example creates a new file in the current working directory with the cleaned text called english-german.pkl.

Some examples of the clean text are printed for us to evaluate at the end of the run to confirm that the clean operations were performed as expected.

[hi] => [hallo]
[hi] => [gru gott]
[run] => [lauf]
[wow] => [potzdonner]
[wow] => [donnerwetter]
[fire] => [feuer]
[help] => [hilfe]
[help] => [zu hulf]
[stop] => [stopp]
[wait] => [warte]
...

1234567891011

[hi] => [hallo][hi] => [gru gott][run] => [lauf][wow] => [potzdonner][wow] => [donnerwetter][fire] => [feuer][help] => [hilfe][help] => [zu hulf][stop] => [stopp][wait] => [warte]...

2. Split Text

The clean data contains a little over 150,000 phrase pairs and some of the pairs toward the end of the file are very long.

This is a good number of examples for developing a small translation model. The complexity of the model increases with the number of examples, length of phrases, and size of the vocabulary.

Although we have a good dataset for modeling translation, we will simplify the problem slightly to dramatically reduce the size of the model required, and in turn the training time required to fit the model.

You can explore developing a model on the fuller dataset as an extension; I would love to hear how you do.

We will simplify the problem by reducing the dataset to the first 10,000 examples in the file; these will be the shortest phrases in the dataset.

Further, we will then stake the first 9,000 of those as examples for training and the remaining 1,000 examples to test the fit model.

Below is the complete example of loading the clean data, splitting it, and saving the split portions of data to new files.

from pickle import load
from pickle import dump
from numpy.random import rand
from numpy.random import shuffle

# load a clean dataset
def load_clean_sentences(filename):
	return load(open(filename, 'rb'))

# save a list of clean sentences to file
def save_clean_data(sentences, filename):
	dump(sentences, open(filename, 'wb'))
	print('Saved: %s' % filename)

# load dataset
raw_dataset = load_clean_sentences('english-german.pkl')

# reduce dataset size
n_sentences = 10000
dataset = raw_dataset[:n_sentences, :]
# random shuffle
shuffle(dataset)
# split into train/test
train, test = dataset[:9000], dataset[9000:]
# save
save_clean_data(dataset, 'english-german-both.pkl')
save_clean_data(train, 'english-german-train.pkl')
save_clean_data(test, 'english-german-test.pkl')

12345678910111213141516171819

How to Develop a Neural Machine Translation System from Scratch

Develop a Deep Learning Model to Automatically
Translate from German to English in Python with Keras, Step-by-Step.

Tutorial Overview

Python Environment

Need help with Deep Learning for Text Data?

German to English Translation Dataset

Preparing the Text Data

1. Clean Text

2. Split Text

How to Develop a Neural Machine Translation System from Scratch

DL4J: How to create a neural network that draws images

Ask HN: How to develop a core competency?

How to Develop a Reusable Framework to Spot

How to Develop a Deep Learning Photo Caption Generator from Scratch

How to write a robust system level service - some key learning - 如何寫好一個健壯的系統級服務

論文筆記-Neural Machine Translation by Jointly Learning to Align and Translate

How to Create a Simple Neural Network in Python

Machine Learning: How to Build a Model From Scratch

How to become a machine learning engineer: A cheat sheet

A powerful machine learning system used by Microsoft has been released to the world

How to build a front-line concussion monitoring system using AWS IoT and serverless data lakes

系統技術非業餘研究 » How to Build a Debug Enabled Erlang RunTime System

How to Implement a Machine Learning Algorithm

How to Make a Computer Operating System

How To Become A Machine Learning Engineer: Learning Path

How to Create a Linux Virtual Machine For Machine Learning Development With Python 3

How to Learn a Machine Learning Algorithm

How to Tune a Machine Learning Algorithm in Weka

How To Get Started In Machine Learning: A Self

How to Develop a Neural Machine Translation System from Scratch

Develop a Deep Learning Model to Automatically Translate from German to English in Python with Keras, Step-by-Step.

Tutorial Overview

Python Environment

Need help with Deep Learning for Text Data?

German to English Translation Dataset

Preparing the Text Data

1. Clean Text

2. Split Text

相關推薦

Develop a Deep Learning Model to Automatically
Translate from German to English in Python with Keras, Step-by-Step.