Dev environment setup for Python data science project

Catalogue
  1. 1. Basic environment requirements
  2. 2. Use pyenv to manage python versions
    1. 2.1. Install pyenv
  3. 3. Use poetry for packge/environment management
    1. 3.1. Install poetry
  4. 4. Build our project
    1. 4.1. Initial setup
    2. 4.2. Add some essential packages for DS development
    3. 4.3. Use jupyter with poetry
    4. 4.4. Run your python script
    5. 4.5. Add black config
    6. 4.6. Install pre-commit
  5. 5. Summary
  6. 6. Reference

The ideal and easy dev setup guideline for a Python-based data science project.

This blog serves as a note for myself when I want to build a Python based data science project. This is a minimal setup and mostly for ad-hoc or light-weight DS experiment projects. We can even drop the pytest part if not needed, or add more parts such as continuous integration and deployment if the project becomes heavier and more mature.

Basic environment requirements

First, let’s lay out the basic environment that we have for the OS system. I’m using macOS Catalina 10.15.3 at the time of writing. Also, I don’t use conda now, but I have to admit that I used conda heavily, and then realized that it is not flexible enough for me to quickly setup or remove light weight project environment.

  • To uninstall conda, we can do this:
    1
    2
    conda install anaconda-clean
    anaconda-clean --yes

Don’t forget to remove the conda initializing bash script and remove it from your PATH in your ~/.bash_profile or ~/.zshrc if you are using zsh

  • Make sure your xcode is updated.

    1
    xcode-select --install
  • Make sure your homebrew is updated

    1
    brew upgrade

If you still want to keep conda, here is the recipe, thanks to this stackoverflow thread:
Expose command conda but don’t activate any environment, even the base environment. Execute the following commands in your shell.

1
2
3
4
5
6
7
# Run the content in the shell

# init conda, the following command write scripts into your shell init file automatically
conda init

# disable init of env "base"
conda config --set auto_activate_base false

Note: After this setup, the default python is the one set by pyenv global. Use pyenv and conda to manage environments separately.

1
2
3
4
5
# virtual environments from conda
conda env create new-env python=3.6
conda env list
conda activate new-env
conda deactivate

Use pyenv to manage python versions

Pyenv is a python installation manage.


Illustrates the hierarchy that pyenv uses to decide which installed python version to use. From realpython.com

Install pyenv

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Install pyenv.
brew install pyenv

# Put the content into ~/.bashrc or ~/.bash_profile for Bash,
# .zshrc for ZSH

# you may need to add dir of command `pyenv` into PATH,
# if command pyenv is not available yet

if command -v pyenv &>/dev/null; then
eval "$(pyenv init -)"
fi
if command -v pyenv-virtualenv &>/dev/null; then
eval "$(pyenv virtualenv-init -)"
fi

# Reload
source ~/.zshrc

Now let’s install some commonly used python versions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# See all available python installations.
pyenv install --list

# Install some python versions.
pyenv install 3.7.5
pyenv install 2.7.17
pyenv install 3.8.0

# See all python installations that you have installed.
pyenv versions

# Set the default/global from one of the python versions.
pyenv global 3.8.0

# In the current directory, set the python version. This creates the file .python-version.
pyenv local 2.7.17

# To see which python is currently being used.
pyenv version

Use poetry for packge/environment management

The biggest advantage of it from my point is that unlike conda, it manages virtual environments as well as project dependencies. And we can choose to put the virtual env folder inside the project itself.

Install poetry

1
$ curl -sSL https://raw.githubusercontent.com/sdispater/poetry/master/get-poetry.py | python

If you want to install the beat version, e.g. 1.0.0b9 which provides more functionalities:

1
2
export POETRY_VERSION=1.0.0b9
POETRY_PREVIEW=1 curl -sSL https://raw.githubusercontent.com/sdispater/poetry/develop/get-poetry.py | python

Make sure $HOME/.poetry/env is added to your $PATH variable. If you are using bash, you can achieve this by adding the following to ~/.bash_profile, and then restart your terminal.

1
2
3
# poetry
source $HOME/.poetry/env
export PATH="$HOME/.poetry/bin:$PATH"

When in question, always refer to poetry’s official tutorial

Build our project

Initial setup

If you want to use poetry in an existing project, run poetry init in your project directory to create a project.toml file

If you want to create a new project and manage it using poetry, run poetry new <name>, and it will create a project with the following structure. Then cd to the project directory.

1
2
3
4
5
6
7
8
<name>
├── pyproject.toml
├── README.rst
├── <name>
│ └── __init__.py
└── tests
├── __init__.py
└── test_<name>.py

Since I prefer to put the virtual environment folder inside the project, so let’s change the configuration

1
2
3
poetry config virtualenvs.in-project true  # for poetry version >= 1.0.*
# poetry config settings.virtualenvs.in-project true # for old poetry version
# poetry config --local virtualenvs.in-project true <--- run this if you only want this config to be local

Then run poetry install: it will use poetry.lock if it exists or if will solve dependencies with pyproject.toml and do installation accordingly. And poetry will check if it’s currently inside a virtualenv and, if not, will use an existing one or create a brand new one for you to always work isolated from your global Python installation.

Add some essential packages for DS development

1
2
3
poetry add pandas numpy scipy matplotlib scikit-learn  # dataframe operation
poetry add --dev black flake8 # formatter
poetry add -D pytest # test, -D is the same as --dev

To install a specific version: poetry add sdk="1.4.4". version constraint syntax: ^1.4 means >=1.4.0 <2.0.0. To downgrade the dependency to a certain version, it is the same; no need to poetry remove sdk first.

Use jupyter with poetry

This part is a bit tricky, at least for me at the first time. So basically, Jupyter works with kernels, and will not work out of the box with your virtual environment that poetry created for you. If you wish to work in a jupyter notebook based on your virtual environment, you need to create a kernel for that virtual environment. The code below explains how. The prerequisite is that you have added both jupyter and ipykernel as dependencies in your poetry project.
Then make sure you run the below command within your virtual environment

1
2
3
4
5
6
poetry add jupyter ipykernel jupyter_nbextensions_configurator jupyter_contrib_nbextensions  # jupyter related
poetry run jupyter contrib nbextension install --user
poetry run jupyter nbextensions_configurator enable --user

# Best practice is to use the same name for your kernel as the project.
poetry run ipython kernel install --user --name=<your project name>

Run your python script

Say we have a demo script – src/example.py which uses the dependencies we have installed using poetry add. Then to run this script:

1
2
3
4
5
6
7
8
9
10
11
# Try running the script outside your virtual environment. This won't work.
python src/example.py

# Run the script within your virtual environment, using the 'run'-command.
poetry run python src/example.py

# Spawn a shell within your virtual environment.
poetry shell

# Try running the script again, after having spawned the shell within your virtual environment.
python src/example.py

Add black config

If you want to add custom black formatter rules, add those to pyproject.toml. An example is like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[tool.black]
line-length = 120
target-version = ['py37']
include = '\.pyi?$'
exclude = '''

(
/(
\.eggs # exclude a few common directories in the
| \.git # root of the project
| \.hg
| \.mypy_cache
| \.tox
| \.venv
| _build
| buck-out
| build
| dist
)/
| foo.py # also separately exclude a file named foo.py in
# the root of the project
)
'''

Install pre-commit

The pre-commit framework is a took which implements pre-commit hooks to your project. Defined hooks are run every time you run git commit -m and will prevent the commit if the hooks fail. Install it with brew install pre-commit.
To install black as a pre-commit hook, add the following in a .pre-commit-config.yaml file to the top of your project:

1
2
3
4
5
6
repos:
- repo: https://github.com/python/black
rev: stable
hooks:
- id: black
language_version: python3.7

To install the hook, go to the top of your project and run pre-commit install

Summary

When we have all the above-mentioned tools installed, the following is the general steps we will need at the next time we initialize a DS Python project:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
poetry new demo
cd demo
pyenv local <python-version>
poetry config settings.virtualenvs.in-project true

poetry add pandas numpy scipy matplotlib scikit-learn # dataframe operation
poetry add --dev black flake8 # formatter
poetry add -D pytest # test

# ===== add jupyter kernels if needed ===== #
poetry add jupyter ipykernel jupyter_nbextensions_configurator jupyter_contrib_nbextensions
poetry run jupyter contrib nbextension install --user
poetry run jupyter nbextensions_configurator enable --user
poetry run ipython kernel install --user --name=<your project name>

Add black configuration to pyproject.toml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[tool.black]
line-length = 120
target-version = ['py37']
include = '\.pyi?$'
exclude = '''

(
/(
\.eggs # exclude a few common directories in the
| \.git # root of the project
| \.hg
| \.mypy_cache
| \.tox
| \.venv
| _build
| buck-out
| build
| dist
)/
| foo.py # also separately exclude a file named foo.py in
# the root of the project
)
'''

Add .pre-commit-config.yaml file, and run pre-commit install afterwards

1
2
3
4
5
6
repos:
- repo: https://github.com/python/black
rev: stable
hooks:
- id: black
language_version: python3.7

After writing your code, and before commit your code, run

1
black .  # format your code

After writing your tests, run

1
2
pytest
# pytest -q test/test1.py

Reference

https://zhuanlan.zhihu.com/p/81025311
https://flynn.gg/blog/software-best-practices-python-2019/
https://blog.jayway.com/2019/12/28/pyenv-poetry-saviours-in-the-python-chaos/