Setting up Python environments¶

Open this notebook in Google Colab

Download this notebook (File -> Save As)

Learning objectives¶

Learn why we need virtual environments.
Learn how to create and use virtual environments using venv.
Learn how to install packages using pip.
Learn how to create an environment specification file using pip freeze.
Learn the differences between major environment management tools venv, conda, and poetry.
Learn the basics of Anaconda environments and conda.
Learn how to create a virtual environment and install packages.
Learn how to use the kernel with Jupyter notebooks/lab.

"Works-on-my-machine" problem and computing environment¶

Since the beginning of computing, this has been the most common problem:

To run any piece of code on any computer, you need many supporting software — the computing environment. The computing environment includes the operating system, programming language, and libraries, and so on.

To address this, we version software and ensures compatability across versions, tracks dependencies, and so on. But they are not perfect, and we may (maybe eternally, gasp) have to deal with the problem of "it worked on my machine."

Still, there are many things we can do to mitigate the "works-on-my-machine" problem. Here, I'd like to provide some general ideas and pointers to help you set up your Python environment.

Reasons to learn Python environment management¶

Setting up and maintaing Python environments can be painful! Python project has not been particularly good with dependency management systems. Python is easy to pick up and start coding because it's interpreted language. Also as a result of this approachability, there are so many users and usecases.

This is a good thing, but it also contributes to the messiness of the Python ecosystem. For use cases where performance is vital (e.g., machine learning and scientific computing) — because Python is a slow, interpreted language — people had to write performance-critical code in C/C++/Fortran and wrap them in Python. This is great for performance, but it makes dependency management even more complicated.

No description has been provided for this image

(Python Environment, by Randall Munroe, https://xkcd.com/1987/)

I must admit that learning how to set up and manage Python environments is not the most exciting thing. Also, you may be lucky enough to work with a dedicated DevOps team that takes care of this for you, or you may be in a situation where you can exclusively use cloud-based services like Google Colab.

On the other hand, you may also be in a situation where you have to set up your own environment, to use cutting-edge packages or due to the constraints of your organization. In such cases, whether you can set up and manage your Python environment, and whether you can successfully install and use a single critical package, can make or break your project.

In addition, because the environment management involves technical details about how your computer works, it can be a great learning opportunity. It can help you understand how your computer and Python work under the hood. Therefore, I encourage you to constantly learn how to set up and manage Python environments and packages!

Having said that, struggling to learn how to manage Python environments on top of completing the weekly assignments and other courses may be too much for you, especially if you are new to programming. So, while I encourage everyone to learn the basics of Python environment management, it will not be a requirement. Please do feel free to use Google Colab or other cloud-based services to circumvent environment management if you get stuck!

A general principle: use virtual environments!¶

Problem with having a single global environment¶

Imagine a data scientist, Alice, who works on two very different projects. In one project, she is working on a machine learning model, which requires a cutting-edge package that is being actively developed. This package makes use of the most recent features (and thus most recent versions) of other foundational packages (e.g., the lastest version of numpy).

In another project, she is debugging and maintaining a legacy codebase that breaks if she uses a recent version of numpy. What should she do in this situation?

If Alice installs the latest version of numpy globally, she will not be able to test the legacy codebase, which requires an older version of numpy, and vice versa. She can potentially buy a new computer for each project, but that's just not practical. 🤑

Solution: virtual environments¶

The solution is to use virtual environments! A virtual environment is a self-contained environment that contains its own version of software packages. Whether the package is a pure Python code or some binary code, there is nothing wrong with keeping multiple versions in the same computer, as long as they are isolated from each other and we can clearly specify which version to use for each project.

This is exactly what all virtual environment tools do. Usually a virtual environment is essentially a folder somewhere in your computer (e.g., in your project directory or a dedicated directory for virtual environments) that contains a copy of Python and other packages. When you activate a virtual environment, it modifies your PATH environment variable so that you use whatever in your virtual environment instead of the global version. Then when you deactivate the virtual environment, it restores the PATH variable to its original state.

In the most basic sense, that's it!

Virtual environments are not just for Python¶

Virtual environments are not just for Python. For example, Node.js has nvm (Node Version Manager), Ruby has rvm (Ruby Version Manager) and bundler, and so on.

If you go one step further, you can use a more general virtual environment tool like Docker to create a virtual environment that contains not only Python but also specific version of other software (e.g., a database) that your project requires.

Which tools to use?¶

There are (too) many tools to manage virtual environments and packages in the Python ecosystem. You can use the barebones venv module, or you can use super powerful, yet complex tools like conda (and the whole anaconda ecosystem). Moreoever, there are even many online services that provide cloud-based Python environments like Google Colab. Here is my quick and dirty recommendation:

Try Google Colab if you don't want to deal with environment management right now and get something done quickly.
If you want to understand how the whole virtual environment system and Python packages work, learn the basics of venv and pip first. They are the most basic tools that lets you learn the core concepts.
If your project uses fairly standard packages, start with uv. It is a modern, super-fast Python package manager that is gaining huge support right now. I believe it will likely become the de facto standard in the future.
If you are working on a data science project and you need to use lots of scientific computing packages, you may have to use conda and anaconda system. The main reason to use conda is that it accommodates many non-Python packages that should be installed to the system outside of Python package system that you cannot install with pip or uv or any other package manager.

Below, you can find the basic instructions for each of these tools.

See Python on the cloud for some basic instructions on how to use Google Colab.
See Basic virtual environment management with venv for the basics of Python virtual environments.
See Anaconda: a powerful Python environment management tool for data science for the basics of conda and anaconda.
See uv: A modern, fast Python package installer and resolver for the basics of uv.

Basic virtual environment management with `venv`¶

What is `venv`?¶

venv is a built-in module in Python that allows us to create isolated virtual environments. Remember that you can think of each virtual environment as a directory that contains Python libraries, each with a specific version. This isolation means you can work on multiple Python projects with different dependencies on the same machine.

Creating a Virtual Environment¶

Open your terminal (Command Prompt on Windows, Terminal on macOS/Linux).
Navigate to the directory where you want to create your virtual environment using the cd command.
Run the following command:

python -m venv myenv

or

python3 -m venv myenv

What does it do? This command creates a new directory named myenv (or whatever you name it) in your current directory. This directory will contain the Python interpreter, a copy of the pip package manager, and other necessary files. It's a self-contained environment where you can install packages without affecting the global Python installation.

But, this step does not activate your virtual environment. You need to activate it to use it.

Activating the Virtual Environment¶

Once the environment is created, you need to activate it.
On Windows: Run

myenv\Scripts\activate

On macOS/Linux: Run

source myenv/bin/activate

What does this do? Activating the virtual environment adjusts your shell’s environment variables so that when you run python, it uses the environment’s Python interpreter and when you run pip, it manages the environment’s packages. It changes your prompt to show the name of the activated environment to let you know that you're using a virtual environment. Always pay attention to which environment you're using!

Installing Packages in the Virtual Environment¶

With the environment activated, you can install Python packages using pip.
For example, to install networkx, run

pip install networkx

If you have activated the virtual environment, networkx should be installed in the virtual environment.

Let's check this. First, we can run pip list to see what packages are installed in the virtual environment.

pip list

You should be able to see something like this:

Package    Version
---------- -------
networkx   X.X.X
pip        XX.X.X
setuptools XX.X.X

You can also see this by navigating into the virtual environment (remember, it's just a directory).

cd myenv
ls

You can do something along this line to see all the packages installed in the virtual environment:

ls myenv/lib/python3.11/site-packages

Letting others to create the same environment¶

You can share the list of packages installed in your virtual environment with others by creating a requirements.txt file.

pip freeze > requirements.txt

This will create a requirements.txt file that contains the list of packages installed in your virtual environment. Try it and see what's in the file.

Deactivating the Virtual Environment¶

When you’re done working in the virtual environment, you can deactivate it by running

`deactivate`

This restores your shell’s environment variables to their normal state, so that python and pip refer to the global Python installation again. So when you're done with a particular project, be sure to deactivate the virtual environment.

Deleting the Virtual Environment¶

If you no longer need the virtual environment, you can simply delete the environment's folder. Everything installed in the virtual environment will be deleted.
Use your file manager or the command line to delete the myenv directory.

What happens? This is just a clean-up step. Since all the environment's files are contained within this directory, deleting it removes the environment completely.

Now you've learned the most basic usage of venv — how to create, use, and manage a basic Python virtual environment. This is a fundamental skill in Python development, especially when working on multiple projects or when projects have differing dependencies. Remember, each virtual environment is independent, so feel free to experiment without worrying about affecting other projects or your system's Python setup!

Anaconda: a powerful Python environment management tool for data science¶

Although venv may be all you need for many use cases (including this course), there are more powerful tools. One such tool, probably the most popular for data science, is Anaconda. It is optional to use Anaconda for this course, but I recommend you to play with it and see if it works for you. It also provides a nice graphical user interface (GUI) for managing environments, which may be helpful for you if you are not comfortable with the command line.

What is Anaconda?¶

Anaconda is not just a virtual environment management tool, but a complete Python distribution that comes with many useful packages for data science. It is often prefered by data scientists because it comes with many packages pre-installed, and it is easy to install additional packages, even those that are nontrivial to install using pip. Another feature is that it comes as an isolated environment (a single folder), so it is easy to use and manage even if you are working in a shared computer that you do not have admin access to.

Installing Anaconda¶

Anaconda

Simply follow the instructions on the website based on your operating system. The default installation comes with Python and many useful packages for data science. It also installs the conda, a command-line tool for managing environments. You can think of conda as a combined tool that does more or less what venv and pip do.

Using Anaconda¶

There are two ways to use Anaconda. If you are not yet comfortable with the command line, you can use the Anaconda Navigator GUI. If you are comfortable with the command line, you can use the conda command. I will not go into details here, but you can find many tutorials online.

Creating a virtual environment with conda¶

Creating a virtual environment is similar to venv.

conda create --name myenv

However, conda does not create the virtual environment (folder) in the current directory. Instead, it creates it in a specific directory that is dedicated to all conda environments. The location will depend on your environment and conda will tell you when you create an environment. It also does some "magic" in the background that we will not go into details here.

A nice feature of conda, compared with venv, is that it allows you to specify which version of Python you want to use. For example, if you want to use Python 3.8, you can run:

conda create --name myenv python=3.8

So, it is much more straightforward to use a specific version of Python with conda than with venv.

Activating the virtual environment¶

Once you have created a conda environment, you can activate it using the conda activate command.

conda activate myenv

Installing packages¶

You can install packages using conda install command. For example, to install networkx, you can run:

conda install networkx

This looks more or less the same as pip install. However, conda is more powerful than pip in that it can install packages that are not pure Python packages that cannot be installed with pip. Although it is becoming easier to install such packages with pip, there are still some packages that are tricky to install with pip. This is where anaconda/conda shines and the reason why so many data science projects use it.

On the other hand, conda does not "know" every single package out there, particularly those that are not related to data science, that can be installed with pip. So, you'd often need to use both conda and pip to install all the packages you need, which is annoying and confusing! Making things even more complicated, there are conda channels, which are like package repositories, that contain packages that are not available in the default conda channel. You may not need to worry about this for this course and as long as you don't go deep into cutting-edge packages, but it is something to keep in mind.

Deactivating the virtual environment¶

You can deactivate the virtual environment using the conda deactivate command.

conda deactivate

Deleting the virtual environment¶

You can delete the virtual environment using the conda remove command.

conda remove --name myenv --all

You can share the environment specification using the conda env export command.

conda env export > environment.yml

And you can create an environment from the specification using the conda env create command.

conda env create -f environment.yml

The environment.yml file is similar to requirements.txt file, but it contains more information about the environment that conda uses (why??? xkcd: standards).

Summary¶

Anaconda is a Python distribution that comes with many useful packages for data science.
It also comes with conda, a powerful command-line tool for managing environments.
You can use Anaconda Navigator GUI or conda command to manage environments.
conda's primary power comes from its ability to install packages that are not pure Python packages and cannot be installed with pip (somewhat common in data science and scientific computing in general).
Another superpower of conda is that it allows you to specify which version of Python you want to use.
However, conda cannot install many non-scientific pacakges (which can be installed by pip). Thus you'd often need to use both conda and pip (it is ok to use both in the same environment).
conda uses its own environment specification file (environment.yml), which is similar to requirements.txt.

Refer the official documentation for more details: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html

uv: A modern, fast Python package installer and resolver¶

uv is a new, high-performance Python package installer and resolver written in Rust. It's designed to be a faster, more reliable alternative to pip, with several key advantages:

Key Features of uv¶

Speed: uv is crazy fast. Often 10-100x faster than pip.
Reliability: Built-in dependency resolution that's more robust than pip's, helping avoid common dependency conflicts.
Compatibility: Works seamlessly with existing Python tooling and virtual environments.
Modern Design: Written in Rust, with a focus on performance and reliability.

Installing uv¶

See https://docs.astral.sh/uv/#installation for more details and up-to-date installation instructions.

Note that uv is installed independently of your Python installation. In other words, it is not constrained by the Python version you have installed, as a package manager, and that is one of the reasons why it can work with multiple Python versions (powerful!).

Basic Usage¶

You can either use uv as a drop-in replacement for pip or as a full-pledged package manager that replaces tools like poetry. It is still fast-evolving and new features are being added regularly. So, I'd encourage to check the Official documentation. See Projects for the usage as a high-level package manager and The pip interface for the usage as a drop-in replacement for pip.

Note that because uv is new and evolving, some AI models may not be aware of its full features. So, be sure to check the official documentation!

Python on the cloud¶

Cloud-based Python environments are becoming more and more popular with the rise of cloud computing. Among many advantages of cloud-based environments, there are two that are particularly relevant.

First, once you have a cloud-based environment set up, you can access it from anywhere. You can work on your project from your laptop, desktop, or even your phone. Second, you don't have to worry about setting up environments and whatever you have on the cloud tend to be exactly replicable. Usually, these cloud environments are pre-configured with all the necessary software, so you can start working on your project right away. Whenever you fire up a Python notebook on the cloud, you have a clean (exactly same) environment that is ready to use, although changing this base environment can be difficult.

Thanks to these strengths, cloud-based environments are becoming more and more popular. Many research papers and tutorials are now published as Jupyter notebooks on Google Colaboratory (Colab), so that they can be easily reproduced and run by anyone.

Google's colaboratory¶

https://colab.research.google.com/notebook

Google Colab is a free cloud-based Python environment that comes with many useful packages pre-installed. It is based on Jupyter notebook, so it is easy to use and share. It is also integrated with Google Drive, so you can easily share your notebooks with others and can use data stored in your Google Drive.

This is what I recommend you to use for this course if you are not comfortable with setting up your own environment in your computer.

Installing packages on Colab¶

Each colab notebook is like a virtual computer that is created whenever you create or open a notebook. It also allows you to install pacakges. You can install packages using pip as you would do on your own computer. However, because it does not provide you with a command line interface, you need to run the pip command inside your notebook.

For example, you can run the following code in a Colab notebook to install networkx (it is already installed though).

!pip install networkx

Note that you need to put ! in front of the command. Whenever we start a line with ! in a Jupyter notebook, it runs the command in the terminal instead of asking it to run as Python code. So this command is equivalent to running pip install networkx in the terminal.

Also note that you need to run this command every time you open a new notebook. This is because each notebook is like a new computer, and you need to install packages every time you open a new notebook.

Jupyter notebook¶

Google Colab is essentially a Jupyter notebook that's running on virtual computers in Google Cloud. So, what is Jupyter notebook?

Interactivity can be incredibly powerful¶

Python is an interpreted language, so we can have a "conversation" with the Python interpreter. For example, we can first define some variables and then use them in the next line.

In [1]: x = 1

In [2]: y = 2

In [3]: x + y

Out[3]: 3

People realized that this ability to converse with a programming language can be incredibly powerful for data science (and scientific computing in general). For instance, when you load a big tabular dataset, it is super useful to be able to explore the dataset interactively and process the data step by step.

Imagine writing a script that performs a series of complicated data processing operations and analyses, where each step depends on the results of the previous steps. To develop this script without any interactivity, you need to go through a tedious loop of (1) writing the initial script, (2) running it, (3) checking the results, and (4) going back to step (1) to change the script. This is not only tedious but also error-prone. Moreover, as the size of the data increases, this process becomes more and more inefficient. You will need to wait for a long time for each iteration.

The idea of computational/computable document¶

Probably the pioneer who realized this potential for the first time was Mathematica, a popular software for mathematical computing. It provides a powerful interface where you can interactively create a document that contains text, code, and visualizations. This document can not only present the results of an analysis, but it can also contain the code that can be executed to reproduce the results and even modified to explore alternatives. This was a revolutionary idea.

However, it was not free and open-source, and it was not Python! This great idea began to spread to other languages, leading to the IPython and Jupyter project.

IPython notebook and Jupyter¶

IPython (Interactive Python) project was a nice attempt to bring this idea to Python. It was a command-line tool (an IDLE replacement) that allowed you to interactively write and execute Python code. It also implemented the idea of computational document, via "IPython notebook", which was pretty much what Mathematica was doing, but without the fancy GUI and limited ability to create and interact with visualizations.

In terms of its capacity, it was not a match for Mathematica, but it was free and open-source, and it was Python! It was also a great tool for data science, and it became very popular among data scientists. This eventually lead to the creation of Jupyter project.

https://jupyter.org/

Jupyter project was born because the same idea can be applied to many other languages besides Python, and because people realized that it is possible to extract the interface of IPython notebook and make it language-agnostic (and web-based!). Whatever interpreted language we use, we can have the exactly same interface that does not care about what language people are inputting. And then the language interpreter can interpret the code and return the results to the interface.

The name Jupyter came from the combination of Julia, Python, and R, the three languages that the Jupyter project initially supported (now it supports many more).

What is JupyterLab?¶

https://jupyterlab.readthedocs.io/en/stable/

JupyterLab is a successor of the initial Jupyter notebook interface. It aims to be a more comprehensive development environment where you can put together multiple notebooks, terminal, text editor, and so on. It is under an active development and this is what you want to use in most cases.

Installing JupyterLab¶

https://jupyter.org/install

You can install JupyterLab using pip or conda.

pip install jupyterlab

or

conda install -c conda-forge jupyterlab

Running JupyterLab¶

You can run JupyterLab using the jupyter lab command.

jupyter lab

This will open a new tab in your browser. You can create a new notebook by clicking the + button on the top left corner. You can also create a new notebook by clicking File -> New -> Notebook in the menu bar.

"Kernels" and ipykernel¶

One thing that may be quite confusing at first, especially with respect to the virtual environments, is the concept of "kernel". Let's say you have created a virtual environment and then ran jupyter lab in the terminal. If you're using Jupyter for the first time, I'd bet that your natural assumption is that Jupyter lab is using the virtual environment you have created and you can use all the packages that you have just installed. But that's not the case! 😬

As mentioned earlier, Jupyter is language-agnostic web interface. It does not care about what language you are using. It just sends the code you write to the language interpreter and returns the results. So, you need to tell Jupyter which language (and which version!) you want to use. This is what "kernel" means.

Say, you have created a virtual environment named myenv and installed networkx in it. The Python interpreter in this virtual environment knows about networkx, but Jupyter does not know about this particular python interpreter (a "kernel") and myenv environment until we tell it.

So, how can we let Jupyter know about our virtual environment and corresponding kernel (Python interpreter)? Here is the steps that we need to take:

Install ipykernel in the virtual environment.
Register the virtual environment as a kernel.
Run Jupyter lab and select the kernel.

1. Install `ipykernel` in the virtual environment¶

First, we need to install ipykernel in the virtual environment. This is a package that allows us to register the virtual environment as a kernel.

pip install ipykernel

2. Register the virtual environment as a kernel¶

Next, we need to register the virtual environment as a kernel. This is done by running the following command:

python -m ipykernel install --user --name=myenv

This will register the current virtual environment as a kernel named myenv. You can check this by running the following command:

jupyter kernelspec list

And then when we open Jupyter lab, we can select the kernel we want to use. Whenever we open a new or existing notebook, we can select or change the kernel by clicking Kernel -> Change kernel in the menu bar. When you change the kernel, it brings all the packages installed in the corresponding virtual environment.

Now you are ready to use Jupyter lab with your virtual environment! 🎉

A convenient way to work with Jupyter notebooks: VSCode and other IDEs¶

VSCode and other IDEs let you work with Jupyter notebooks more or less in the same way as JupyterLab (remember that Jupyter is just a language-agnostic interface for Python and other languages). It also removes a lot of tedious steps that you need to take with JupyterLab. For example, you don't need to install ipykernel and register the virtual environment as a kernel. VSCode handles that and let you simply choose the virtual environment you want to use.

I strongly recommend to use VS Code for this course and for general data science in general. It is a great IDE with a powerful plugin ecosystem.

You can find great tutorials on how to use VS Code for Jupyter notebooks. Here are some examples: https://www.youtube.com/results?search_query=vscode+jupyter+notebook

Assignments¶

Q1: Create a virtual environment using your preferred method (e.g., venv, conda, or poetry). Install the following packages into your virtual environment. Create your environment file and submit it.

List of pacakges to install (you can install more if you want):

numpy
pandas
matplotlib
networkx
scipy
jupyterlab
ipykernel
nbformat

The environment file can be the following:

If you use venv, submit requirements.txt file created by pip freeze.
If you use conda, submit environment.yml file created by conda env export > environment.yml.
If you use poetry, submit pyproject.toml and poetry.lock files created by poetry

Q2: set up Jupyter lab and ipykernel in your virtual environment. Create a notebook, choose the right kernel, create a cell and import the packages that you have installed into the virtual environment (see below). Make sure to run this cell and check whether they can be imported successfully. Submit this notebook as a HTML file and a notebook file.

Check this document for instructions on exporting a notebook as an HTML file.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
import scipy as sp