5 ESSENTIAL PYTHON TOOLS FOR DATA SCIENCE—NOW IMPROVED
If you want to master, or even just use, data analysis, Python is the place to do it. Python is easy to learn, it has vast and deep support, and most every data science library and machine learning framework out there has a Python interface.
Over the last couple of months, several data science projects for Python have released new versions with major feature updates. Some are about actual number-crunching; others make it easier for Pythonistas to write fast code optimized for those jobs.
Essential Python for data science: SciPy 1.0
What’s SciPy for
Python users who want a fast and powerful math library can use NumPy, but NumPy by itself isn’t very task-focused. SciPy uses NumPy to provide libraries for common math- and science-oriented programming tasks, from linear algebra to statistical work to signal processing.
How SciPy 1.0 helps with data science
SciPy has long been useful for providing convenient and widely used tools for working with math and statistics. But for the longest time, it didn’t have a proper 1.0 release, although it had strong backward compatibility across versions.
The trigger for bringing the SciPy project to version 1.0, according to core developer Ralf Gommers, was chiefly a consolidation of how the project was governed and managed. But it also included a process for continuous integration for the MacOS and Windows builds, as well as proper support for prebuilt Windows binaries. This last feature means Windows users can now use SciPy without having to jump through additional hoops.
Where to download SciPy
Essential Python for data science: Dask 0.15.4
What Dask is
Processing power is cheaper than ever, but it can be tricky to leverage it in the most powerful possible way—by breaking tasks across multiple CPU cores, physical processors, or compute nodes.
Dask takes a Python job and schedules it efficiently across multiple systems. What’s most useful about Dask is that the syntax used to launch Dask jobs is virtually the same as the syntax used to do other things in Python, so it requires little reworking of existing code to be useful.
How Dask helps with data science
Dask provides its own versions of some interfaces for many popular machine learning and scientific-computing libraries in Python. Its DataFrame object is the same as the one in the Pandas library; likewise, its Array object works just like NumPy’s. This way, you can quickly parallelize existing code by changing only a few lines of code.
Dask can also be used to parallelize jobs written in pure Python, and has object types (such as Bag) suited to optimizing those types of jobs.
Where to download Dask
Dask is available on the Python Package Index, and can be installed via
pip install dask. It’s also available via the Anaconda distribution of Python, by typing
conda install dask. Source code is available on GitHub.
Essential Python for data science: Numba 0.35.0
What Numba is
Numba lets Python functions or modules be compiled to assembly language via the LLVM compiler framework. You can do this on the fly, whenever a Python program runs, or ahead of time. In that sense, Numba is like Cython, but Numba is often more convenient to work with, although code accelerated with Cython is easier to distribute to third parties.
How Numba helps data science
The most obvious way Numba helps data scientists is by speeding operations written in Python. You can prototype projects in pure Python, then annotate them with Numba to be fast enough for production use.
Numba can also provide speedups that run even faster on hardware built for machine learning and data science applications. Earlier versions of Numba supported compiling to CUDA-accelerated code, but the most recent versions sport a new, far-more-efficient GPU code reduction algorithm for faster compilation.
Numba also uses contributions from Intel, via the ParallelAccelerator project, to speed up certain operations by automatically parallelizing them. Warning: The ParallelAccelerator additions are still experimental, so they shouldn’t be used in production yet.
Where to download Numba
Numba is available on the Python Package Index, and it can be installed by typing
pip install numba from the command line. Prebuilt binaries are available for Windows, MacOS, and generic Linux. It’s also available as part of the Anaconda Python distribution, where it can be installed by typing
conda install numba. Source code is available on GitHub.
Essential Python for data science: Cython 0.27
What’s Cython for
Cython transforms existing Python code into C code that can run orders of magnitude faster. This transformation comes in most handy with code that’s math-heavy or runs in tight loops, something you see a lot in Python programs written for engineering, science, and machine learning.
How Cython 0.27 helps with data science
The latest version of Cython broadens support for integration with IPython/Jupyter notebooks. Cython-compiled code can already be used in Jupyter notebooks via inline annotations, as if Cython code were any other Python code.
With Cython 0.27, you can now compile Cython modiles for Jupyter with profile-guided optimization enabled. Modules built with this option are compiled and optimized based on profiling information generated for them, so they run faster. Note that this option is only available for Cython when used with the GCC compiler; MSVC support isn’t there yet.
Where to get Cython
Cython is available on the Python Package Index, and it can be installed with
pip install cython from the command line. Binary versions for 32-bit and 64-bit Windows, generic Linux, and MacOS are included. Source code is on GitHub.
Essential Python for data science: HPAT
What HPAT is
Intel’s High Performance Analytics Toolkit (HPAT) is an experimental project for accelerating data analytics and machine learning on clusters. It compiles a subset of Python to code that is automatically parallelized across clusters using the Open MPI project’s
How HPAT helps data science
HPAT uses Numba, but unlike that project and Cython, it doesn’t compile Python as is. Instead, it takes a restricted subset of the Python language—chiefly, NumPy arrays and Pandas dataframes—and optimizes them to run across multiple nodes.
Like Numba, HPAT has the
@jit decorator that can turn specific functions into their optimized counterparts. It also includes a native I/O module for reading from and writing to HDF5 (not HDFS) files.