Module 1: Python environment setup¶
Overview of Python Visualization Toolkit¶
These two talks provide pretty good overview of the Python visualization landscape.
- Jake VanderPlas at PyCon 2017: The Python Visualization Landscape.
- James Bednar at AnacondaCon 2018: PyViz: Dashboards for Visualizing 1 Billion Datapoints in 30 Lines of Python.
Python environment setup¶
When you work on data visualizations, there are two main choices: local and cloud. Working locally means installing various packages in your computer and working on the cloud means using services like Google Colaboratory.
Local installation can be nice and convenient (when it works) because everything is in your computer. You can work without network connection and everything can be configured exactly as you see fit. However, local installation can be a huge source of pain because there are multiple ways to install packages (e.g., Anaconda, System Python, etc.) and they can interfere and break. This "dependency hell" can be extremely frustrating. At the same time, working locally is limited by the power of your computer. For instance, you may not be able to run a heavy computational task because your laptop is 10 years old or because you refuse to delete a game for a huge dataset.
On the other hand, working on a cloud environment can also be nice and convenient (when it works). You don't need to worry about different installation and dependency hell, as long as the environment that the cloud provider provides is adequate. All computation is done on the cloud so your computer will not suffer. But that also means you can only work within the boundary set by the cloud provider. For instance the free Google Colaboratory has a certain limit on memory and disk usage, and to go beyond you have to pay.
My recommendation is the following: if you are comfortable with managing your computer and multiple Python installations and packages, or want to learn system management, go with local option. You can either use Anaconda or just vanilla Python (with virtuanenv
and pip
). If you don't want to spend time on this, use Google Colaboratory.
Local setup¶
If you don't have much experience with Python ecosystem, we recommend using Anaconda, which makes most data analysis tasks easier. But it is totally fine not to use Anaconda.
Setting up Anaconda environment¶
Anaconda is a popular, all-in-one Python package distribution system. If your work is focused on data analysis, machine learning, visualization, etc., then just sticking with Anaconda is probably the most convenient way to manage multiple Python versions, packages, and virtual environments. Anaconda comes with a package manager called conda
that replaces pip
and virtualenv
. It allows you to install packages, create virtual environments, and more. It even allows you to set a different Python version for each virtual environment. If you use Anaconda, you want to stick to it as much as possible by exclusively using conda
(use pip
only when there is no way to install with conda
).
First, download Anaconda for your system from here. Follow the instructions and reopen your shell (or use the Anaconda GUI system). By default, Anaconda activates base
environment. You can install packages by running
$ conda install packagename
Let's install core Python data packages.
$ conda install numpy scipy pandas scikit-learn matplotlib seaborn jupyter jupyterlab
We usually want to use a virtual environment for each of your project. By using virtual environments, you can isolate each environment from the others and maintain separate sets (versions) of packages. conda
has a built-in support for virtual environments.
$ conda create -n dviz python=3.8
This command creates a virtual environment named (-n
) dviz
with Python 3.8. You can activate the environment (whenever you begins to work on this course) by running
$ conda activate dviz
and deactivate (when you're done) by running
$ conda deactivate
For the full documentation, see https://conda.io/docs/user-guide/tasks/manage-environments.html
When the environment is activated, your prompt will show the name of the environment. Now if you install a package, it will be installed into this environment, but not in the other virtual environment. In other words, each virtual environment maintains its own "world" and totally different packages can be installed in each "world".
But even if you have created a virtual environment and installed all necessary packages, you can't use this environment in the Jupyter notebook/Lab yet. Jupyter needs a special program called "IPython kernel" to execute code. So, for each virtual environment we need to also create a corresponding kernel. We can do manually by using ipykernel
package like following:
$ conda activate dviz
$ conda install ipykernel
$ python -m ipykernel install --user --name=dviz
or we can use nb_conda
package by installing it.
$ conda install nb_conda
Once it's installed, kernels for all virtual environments created by conda
are automatically created and show up in Jupyter. So this is a much more convenient option.
pip
+ virtualenv
¶
If you are not using Anaconda, you can use Python's virtualenv
to create virtual environments and pip
to install packages to the virtual environments. See https://docs.python.org/3/tutorial/venv.html For instance, you can run the following to create a virtual environment in the current directory (it will create dviz-env
folder).
$ python3 -m venv dviz-env
and then activate and deactivate it by
$ source dviz-env/bin/activate
$ deactivate
Once you activate a virtual environment, you can install packages into it with pip
.
$ pip install packagename
Like the case with conda
, we need to install a kernel for each virtual environment to use it in Jupyter.
$ source dviz-env/bin/activate
$ pip install ipykernel
$ python -m ipykernel install --user --name=dviz
Jupyter¶
Once you have setup your local environment, you can run
$ jupyter lab
or use Jupyter notebook.
jupyter notebook
Jupyter lab is the 'next generation' notebook system with many powerful features. Some packages that we use work more nicely with Jupyter lab (for some lab assignments you may need to use jupyter notebook instead of the lab).
Jupyter also has a desktop app: https://github.com/jupyterlab/jupyterlab-desktop
VS code¶
Visual Studio code is a powerful editor and it supports notebooks and interactivity for Python. If you open a Jupyter notebook file, it works seamlessly within VS code. With the Remote - SSH plugin, you can also maintain almost identical experience using a computing server.
Cloud setup¶
There are many cloud-based Jupyter notebook services. The best option is probably the Google Colaboratory. It allows installation of packages and you can even use GPU/TPUs!
Google colaboratory¶
Google Colaboratory is Google's collaborative Jupyter notebook service. This is the recommended cloud option. Most packages that we use are already installed and you can install additioanl packages by running
!pip install packagename
in the computing cell.
Other options¶
- Azure notebooks: Microsoft also has a cloud notebook service called Azure notebooks. This service also allows installing new packages through
!pip install ...
. - CoCalc (https://cocalc.com/) is a service by SageMath. You can use it freely but the free version is slow and can be turned off without warning. Most of the packages that we use are pre-installed. We may be able to provide a subscription through the school.
- Kaggle Kernels: The famous machine learning / data science competition service Kaggle offers cloud-based notebooks called Kaggle kernels. Because you can directly use all the Kaggle datasets, it is an excellent option to do your project if you use one of the Kaggle datasets. It allows uploading your own dataset and install some packages, but not all packages are supported.
Assignment¶
You can either use a local environment or a cloud environment.
- Set up your local Python environment following the instructions. You should be using a virtual environment on your local machine.
- Install Jupyter notebook and Jupyter lab.
- Launch jupyter notebook (lab)
- Create a new notebook and play with it. Print "Hello world!".
If you want to use a cloud environment,
- Play with the Google colaboratory.
- Try installing/importing the following packages in the notebook.
- Create a cell that prints "Hello world!".
- Submit the notebook.
Finally, these are the packages that we plan to use. So check out their homepages and figure out what they are about.
- Jupyter Notebook and Lab: https://jupyter.org/
- numpy: http://www.numpy.org/
- scipy: http://www.scipy.org/
- matplotlib: http://matplotlib.org/
- seaborn: http://seaborn.pydata.org/
- pandas: http://pandas.pydata.org/
- scikit-learn: http://scikit-learn.org/stable/
- altair: https://github.com/altair-viz/altair
- vega_datasets: https://github.com/altair-viz/vega_datasets
- bokeh: http://bokeh.pydata.org/en/latest/
- datashader: http://datashader.org/
- holoviews: http://holoviews.org/
- wordcloud: https://github.com/amueller/word_cloud
- spacy: https://spacy.io/
Install them using your package manager (conda or pip).
Once you have installed the Jupyter locally or succeeded with a cloud environment, run the following import cell to make sure that every package is installed successfully. Submit the notebook on the canvas.
!pip install numpy scipy matplotlib seaborn pandas altair vega_datasets scikit-learn bokeh datashader holoviews wordcloud spacy
import numpy
import scipy
import matplotlib
import seaborn
import pandas
import altair
import vega_datasets
import scikit-learn
import bokeh
import datashader
import holoviews
import wordcloud
import spacy