Python Data Analysis Ecosystem — A Beginner’s Roadmap
The name of the programming language, which was published in 1991, actually goes back to the comedy group Monty Python and not to the snake species.
Python is one of the most widely used programming languages. The flexible programming language is prevalent in data science, data analysis, machine learning, and deep learning, but Python is also used in web development (e.g. CMS Django) and automation.
Fun fact: The name of the programming language, which was published in 1991, actually goes back to the comedy group Monty Python and not to the snake species.
In this article, you will learn in 5 minutes how to get started with Python as a beginner, how to set up your working environment, and what the Python ecosystem for data analysis/data science looks like.
Python Data Analysis Ecosystem
Python as a programming language is known for its simplicity and variety. You can use Conda to manage different project environments. An alternative is Poetry, although Conda is used much more frequently. Conda is a powerful package management and environment management system that facilitates the installation, execution, and updating of software packages and their dependencies.
You can use GitHub for version control and teamwork, where you can host or share code. Anaconda is a free distribution of Python (and R) and contains over 1500 packages for data science and machine learning. The Python Package Index is a repository of software for the Python programming language that makes it easy to search for and install Python packages. With Binder, you can create interactive, reproducible environments from Git repositories that contain Jupyter notebooks (Jupyter Lab).
The specific tools include libraries that make your life easier. Numpy is a library for working with multidimensional arrays. PyTorch and TensorFlow are libraries that you can use primarily for machine learning — for example for the application of neural networks. In Scikit-Learn you will find simple tools for data analysis. You will almost certainly come into contact with Pandas right from the start. Pandas is a library for data manipulation and analysis.
To interact with the code, you can use VS Code or what I recommend for beginners — Jupyter Lab. The successor to Jupyter Notebook is a web-based interactive development environment in which you can write Python code and execute it directly. You also have the option of using Markdown in it. With Markdown, you can write easily readable text next to the code blocks directly in the notebook.
Installation of Python
To install Python on your Windows operating system, you can download the installation file from the official Python website. When going through the installation, it is important that you select the option “Add Python to PATH” so that you can then execute commands via the Windows command line.
If you first want to check whether you already have a Python version installed, you can enter the following command in Powershell or the command prompt (Windows key+R):
python --version
If you have installed Python, you will see the version:
Your Working Environment
It’s best to install Anaconda to get started. The application is a comprehensive distribution that already contains many relevant libraries for data science. You can think of a distribution as a large box that already contains many tools and materials specifically for data scientists/data analysts. This means you don’t have to start from scratch every time you start a new project. In addition to Jupyter Lab/Jupyter Notebook, for example, many packages are already pre-installed, so you only have to import them afterward. The best way to start your first Python data science project is with Jupyter Lab/Jupyter Notebook. This is an interactive environment that is particularly suitable for exploratory data analysis and machine learning. A cloud-based alternative is Google Colab. This allows you to access a powerful computing resource and use the GPU without having to perform a complex local installation.
Setting up an Environment for a Project
Once you have installed Anaconda, you can use the “Anaconda Prompt” or the “Anaconda Powershell Prompt” on a Windows system. These are special terminals in which the Conda system is already integrated. Conda serves as a tool for package and environment management and allows you to create isolated environments for different projects. This is particularly useful to avoid the requirements and dependencies of different projects interfering with each other. In each of these environments, you can install the libraries and packages required for your project, such as Numpy, Pandas, or Seaborn. These packages may in turn depend on other packages, which would make management complicated without a tool like Conda.
1. Type the command conda in your terminal:
You can check whether conda is installed.
The term `(base)` in the terminal shows the current environment. This is the standard environment of Anaconda. It’s recommended not to install any additional packages in this “base” environment to keep it clean and unchanged. Instead, you should create a separate environment for each new project.
2. You can set up a new environment with the following command:
conda create --name NameEnvironment python=3.10 jupyterlab scikit-learn
A new environment is created in which Python version 3.10 is to be installed. JupyterLab (successor to Jupyter Notebook) and the scikit-learn library are installed in this environment.
3. When asked if you want to continue, select y for yes.
4. With the following command you can switch to the created environment.
conda activate NameEnvironment
Another option is to create a .yml file with the name, the channel, and all the packages you want to install and then create your environment with the following command. Make sure that you are in the same directory in the terminal as the file is stored:
conda env create -f "PathToYourFile.yml"
Conclusion
Python has established itself as one of the most widely used programming languages. If you know Python, you can use the language in various areas such as data science, but also on the web. To dive into Python, install Anaconda, download Python, and start with a simple data analysis task (for example, exploratory data analysis) in the Jupyter Lab/Jupyter Notebook.
In the next article, I will introduce you to the 5–10 most important Python libraries that every beginner should know. Feel free to share your experiences and challenges you have encountered when getting started with Python programming.