< Iinitialize My Portfolio..._

New York City Taxi & Limousine Commission (TLC) - Data Visualization Suite

Professional visualizations for fare prediction analysis and operational insights

CASE STUDY

4 min read

NYC TLC Fare Prediction & Ridership Analysis

Exploratory data analysis and visualization to support fare estimation modeling and operational insights.

Business Context / Overview

Background:
Automatidata is a data consulting firm specializing in transforming unused data into business solutions. The firm is consulting for the New York City Taxi and Limousine Commission (TLC), responsible for licensing and regulating NYC's 200,000+ taxi cabs and for-hire vehicles.

Why this analysis was required:
To develop a regression model that estimates taxi fares before rides, based on historical trip data. This will help TLC improve fare transparency, customer satisfaction, and operational efficiency.

Who benefits:

NYC TLC Operations & Finance Departments
Taxi drivers and fleet operators
NYC residents and visitors through more predictable pricing
Automatidata's data consulting team

Problem Statement

The NYC TLC lacks a reliable, data-driven method to estimate taxi fares in advance. Without accurate fare prediction, passengers face pricing uncertainty, drivers experience income variability, and the commission struggles with fare dispute resolutions. Current manual estimation methods are inconsistent and don't account for temporal, geographical, and trip-specific factors.

Objectives

Analyze historical trip patterns to identify key fare determinants
Examine temporal patterns (monthly, daily, hourly) in ride volumes and revenues
Investigate trip characteristics (distance, duration, passenger count) and their relationship to fares
Identify outliers and data quality issues that could impact model accuracy
Create accessible visualizations for both technical and non-technical stakeholders

Dataset Description

Source: NYC TLC 2017 Yellow Taxi Trip Data (2017_Yellow_Taxi_Trip_Data.csv)
Records: 22,699 taxi trips
Features: 18 original columns including:

trip_distance, fare_amount, total_amount
tpep_pickup_datetime, tpep_dropoff_datetime
passenger_count, PULocationID, DOLocationID
tip_amount, tolls_amount, improvement_surcharge
Derived: trip_duration, month, day_of_week
Time Period: 2017 calendar year

Data Preparation & Cleaning

Converted datetime columns to proper datetime format
Created derived features: trip duration, month, day of week
Handled zeros in passenger_count (33 rides with 0 passengers flagged for investigation)
Assessed outliers in trip_distance, total_amount, and tip_amount
Validated data completeness - no missing values in key columns
Checked for data type consistency across numerical and categorical variables

Exploratory Data Analysis (EDA)

Trip distance distribution: Majority under 5 miles, right-skewed with outliers to 34 miles
Temporal analysis: Monthly and daily patterns in ride volume and revenue
Fare composition: Breakdown of base fare, tips, tolls, and surcharges
Passenger analysis: Relationship between passenger count and tip amounts
Geographic patterns: Drop-off location distribution and trip distance by location
Vendor comparison: Trip distribution between two taxi vendors

Analytical Approach

Statistical summaries using .describe() and .groupby()
Distribution analysis via histograms and box plots
Temporal trend analysis using time-series aggregation
Comparative analysis between vendor performance
Correlation analysis between distance, duration, and fare amount
Outlier detection using IQR method for trip distances and fare amounts

Key Insights

Most trips are short-distance: 75% of trips are under 3.06 miles
Temporal patterns: July-August show ride dips; Thursday has highest revenue
Fare determinants: Distance shows strongest correlation with total fare
Tip patterns: Majority of tips are $0-3, with consistent distribution across vendors
Passenger behavior: Single passengers dominate (71% of trips), but don't tip significantly more
Geographic distribution: Drop-off locations are evenly distributed, but traffic concentrates at popular locations

Visualizations & Reporting

Python (Matplotlib/Seaborn):
- Box plots for trip distance and fare outliers
- Histograms for distance, fare, and tip distributions
- Bar charts for monthly/daily ride counts and revenue
- Comparative plots for vendor performance
- Geographic analysis via drop-off location distributions

Trip Distance Distribution - Box Plot & Histogram:

Fare Amount vs Trip Distance - Scatter Plot:

Monthly Ride Count & Revenue:

Daily Patterns - Ride Count & Revenue:

Passenger Count Distribution & Impact on Tips:

Top 20 Drop-off Locations by Trip Volume:

Distance by Drop-off Location:

Comprehensive Overview for Management:

Accessibility-Friendly Version:

For stakeholders with visual impairments

Tools & Tech Stack

Language: Python
Libraries: Pandas, NumPy, Matplotlib, Seaborn, datetime
Environment: Jupyter Notebook
Visualization Tool: Tableau Public
Data Storage: CSV

Challenges & Limitations

Limited geographic context: Location IDs without coordinates
Data representativeness: Sample may not reflect full year patterns
Zero-passenger rides: 33 trips with 0 passengers require investigation
Vendor data imbalance: Uneven distribution between two vendors
Seasonal coverage: Potential gaps in monthly representation

Results & Business Impact

Clear fare determinants identified: Distance as primary predictor
Temporal patterns documented: Peak revenue days identified (Thursday)
Data quality assessment: Outliers and anomalies flagged for review
Visual foundation established: Dashboards for ongoing monitoring
Model readiness: Features identified for regression modeling
Operational insights: Low-utilization periods identified for optimization

Recommendations

Develop regression model using trip_distance, temporal features, and pickup/drop-off zones
Implement real-time fare estimator for mobile apps and taxi meters
Create driver guidance system for high-revenue time/location targeting
Establish data quality monitoring for zero-passenger and outlier trips
Expand geographic analysis with coordinates for pickup/drop-off locations
Develop seasonal adjustment factors for fare prediction model

Key Learnings

Urban mobility patterns show clear temporal and spatial concentrations
Fare structure transparency is critical for customer trust and satisfaction
Data quality vigilance is essential when working with operational systems
Accessible visualization enables broader stakeholder engagement
Feature engineering (trip duration, temporal features) adds significant predictive power