1. 程式人生 > >🚀 100 Times Faster Natural Language Processing in Python

🚀 100 Times Faster Natural Language Processing in Python

So, how can we speed up these loops?

Fast Loops in Python with a bit of Cython

Let’s work this out on a simple example. Say we have a large set of rectangles that we store as a list of Python objects, e.g. instances of a Rectangle class. The main job of our module is to iterate over this list in order to count how many rectangles have an area larger than a specific threshold.

Our Python module is quite simple and looks like this:

The check_rectangles function is our bottleneck! It loops over a large number of Python objects and this can be rather slow as the Python interpreter does a lot of work under the hood at each iteration (looking for the area method in the class, packing and unpacking arguments, calling the Python API
.).

Here comes Cython to help us speed up our loop.

The Cython language is a superset of Python that contains two kind of objects:

  • Python objects are the objects we manipulate in regular Python like numbers, strings, lists, class instances

  • Cython C objects are C or C++ objects like double, int, float, struct, vectors
    that can be compiled by Cython in super fast low-level code.
A fast loop is simply a loop in a Cython program within which we only access Cython C objects.

A straightforward approach to designing such a loop is to define C structures that will contain all the things we need during our computation: in our case, the lengths and widths of our rectangles.

We can then store our list of rectangles in a C array of such structures that we will pass to our check_rectangle function. This function now has to accept a C array as input and thus will be defined as a Cython function by using the cdef keyword instead of def (note that cdef is also used to define Cython C objects).

Here is how the fast Cython version of our Python module looks like:

Here we used a raw array of C pointers but you can also choose other options, in particular C++ structures like vectors, pairs, queues and the like. In this snippet, I also used the convenient Pool() memory management object of cymem to avoid having to free the allocated C array manually. When Pool is garbage collected by Python, it automatically frees the memory we allocated using it.

A good reference on the practical usage of Cython in NLP is the Cython Conventions page of spaCy’s API.

đŸ‘©â€đŸŽš Let’s Try that Code!

There are many ways you can test, compile and distribute Cython code! Cython can even be used directly in a Jupyter Notebook like Python.

First install Cython with pip install cython

First Tests in Jupyter

Load the Cython extension in a Jupyter notebook with %load_ext Cython.

Now you can write Cython code like Python code by using the magic command %%cython.

If you have a compilation error when you execute a Cython cell, be sure to check Jupyter terminal output to see the full message.

Most of the time you’ll be missing a-+ tag after %%cython to compile to C++ (for example if you use spaCy Cython API) or an import numpy if the compiler complains about NumPy.

As I mentioned in the beginning, check the Jupyter Notebook accompanying this post, it has all the examples we discuss running in Jupyter.

Writing, Using and Distributing Cython Code

Cython code is written in .pyx files. These files are compiled to C or C++ files by the Cython compiler and then to byte-code level with the system’s C compiler. The byte-code level files can then be used by the Python interpreter.

You can load a .pyx file directly in Python by using pyximport:

>>> import pyximport; pyximport.install()
>>> import my_cython_module

You can also build your Cython code as a Python package and import/distribute it as a regular Python package as detailed here. This can take some time to get working, in particular on all platforms. If you need a working example, spaCy’s install script is a rather comprehensive one.

Before we move to some NLP, let’s quickly talk about the def, cdef and cpdef keywords, because they are the main things you need to grab to start using Cython.

You can use three types of functions in a Cython program:

  • Python functions, which are defined with the usual keyword def. They take as input and output Python objects. Internally they can use both Python and C/C++ objects and can call both Cython and Python functions.
  • Cython functions defined with the cdef keyword. They can take as input, use internally and output both Python and C/C++ objects. These functions are not accessible from the Python-space (i.e. the Python interpreter and other pure Python modules that would import your Cython module) but they can be imported by other Cython modules.
  • Cython functions defined with the cpdef keyword are like the cdef Cython functions but they are also provided with a Python wrapper so they can be called from the Python-space (with Python objects as inputs and outputs) as well as from other Cython modules (with C/C++ or Python objects as inputs).

The cdef keyword has another use which is to type Cython C/C++ objects in the code. Unless you type your objects with this keyword, they will be considered as Python objects (and thus slow to access).