This project explores the New York Yellow Taxi dataset available from the NYC Taxi & Limousine Commission, focusing on the December 2023 Parquet data.
The goal is to set up a comprehensive data engineering environment on-premises, conduct exploratory data analysis, and provide insights and recommendations to enhance the efficiency and service quality of NYC yellow taxis.
The New York Yellow Taxi service is an integral part of the city's transportation network. This project aims to leverage data engineering practices to uncover insights that could lead to improved taxi services, optimized route management, and better customer satisfaction.
- Software Framework: Docker
- Database: Postgres
- Data Analysis & Exploration: SQL/Python
- Data Visualization: Jupyter Notebook
- CICD: Git
- Environment Setup: Clone the repository and ensure Docker is installed on your system.
- Database Configuration: Use the provided Docker Compose file to set up a Postgres database container.
- Data Ingestion: Run the Python scripts to ingest data into the Postgres database.
- Analysis: Open the Jupyter notebooks to start exploring the data and generating insights.
Conduct a thorough EDA to uncover any initial insights or patterns in the data, focusing on the questions
- What are the peak hours for taxi demand?
- How does passenger count vary throughout the day?
- What is the average duration of a taxi ride?
- Are there any trends in ride durations or distances over time?
- How does the taxi usage vary by area?
The insights generated from this analysis could inform strategic decisions to improve taxi efficiency and service quality, such as adjusting fleet sizes during peak hours, optimizing route planning, and tailoring services to meet customer demand more effectively. Please find the complete analysis with an executive summary at the end of each section.