Tools Setup
Tools Setup
Section titled “Tools Setup”Core Tools
Section titled “Core Tools”UV Package Manager
Section titled “UV Package Manager”# Install UVcurl -LsSf https://astral.sh/uv/install.sh | sh
# Add to PATHexport PATH="$HOME/.cargo/bin:$PATH"Python Environment
Section titled “Python Environment”# Create virtual environmentuv venv
# Activate environmentsource .venv/bin/activate
# Install dependenciesuv pip install -r requirements.txtTools and Setup
Section titled “Tools and Setup”This guide will help you set up your development environment for the 5-Hour Data Engineering Boot Camp.
Required Software
Section titled “Required Software”1. Python Installation
Section titled “1. Python Installation”- Download Python 3.8 or higher from python.org
- During installation, make sure to check “Add Python to PATH”
- Verify installation:
python --versionpip --version2. Code Editor
Section titled “2. Code Editor”We recommend Visual Studio Code:
- Download from code.visualstudio.com
- Install recommended extensions:
- Python
- Pylance
- GitLens
- SQLTools
3. Git
Section titled “3. Git”- Download Git from git-scm.com
- Verify installation:
git --versionProject Setup
Section titled “Project Setup”1. Create a Virtual Environment
Section titled “1. Create a Virtual Environment”# Create a new directory for your projectmkdir data-engineering-bootcampcd data-engineering-bootcamp
# Create a virtual environmentpython -m venv venv
# Activate the virtual environment# On Windows:venv\Scripts\activate# On macOS/Linux:source venv/bin/activate2. Install Required Packages
Section titled “2. Install Required Packages”Create a requirements.txt file with the following content:
pandas>=1.5.0numpy>=1.21.0sqlalchemy>=1.4.0pytest>=7.0.0apache-airflow>=2.5.0pydantic>=2.0.0requests>=2.28.0Install the packages:
pip install -r requirements.txt3. Set Up Version Control
Section titled “3. Set Up Version Control”# Initialize Git repositorygit init
# Create .gitignore fileecho "venv/__pycache__/*.pyc.env.DS_Store" > .gitignore
# Make initial commitgit add .git commit -m "Initial project setup"Development Environment
Section titled “Development Environment”1. Project Structure
Section titled “1. Project Structure”Create the following directory structure:
data-engineering-bootcamp/├── src/│ ├── extractors/│ ├── transformers/│ ├── loaders/│ └── quality/├── tests/├── config/└── data/ ├── raw/ └── processed/2. Database Setup
Section titled “2. Database Setup”For the boot camp, we’ll use SQLite for simplicity:
# Create a data directorymkdir -p data/raw data/processed3. Environment Variables
Section titled “3. Environment Variables”Create a .env file for configuration:
DATABASE_URL=sqlite:///data/processed/database.dbAPI_KEY=your_api_key_hereTesting Your Setup
Section titled “Testing Your Setup”Create a simple test script test_setup.py:
import pandas as pdimport numpy as npfrom sqlalchemy import create_engineimport os
def test_environment(): # Test pandas df = pd.DataFrame({'test': [1, 2, 3]}) assert len(df) == 3
# Test numpy arr = np.array([1, 2, 3]) assert arr.sum() == 6
# Test SQLAlchemy engine = create_engine('sqlite:///data/processed/test.db') df.to_sql('test', engine, if_exists='replace')
print("All tests passed! Your environment is ready.")
if __name__ == "__main__": test_environment()Run the test:
python test_setup.pyNext Steps
Section titled “Next Steps”Now that your environment is set up, you can:
- Review the Prerequisites if needed
- Start with Data Engineering Fundamentals
- Check out Additional Resources for more learning materials