< Iinitialize My Portfolio..._

An open empty notebook on a white desk next to an iPhone and a MacBook

📊 Data Analysis Case Study: TikTok Claim vs Opinion Video Classification.

TikTok’s mission is to inspire creativity and bring joy. With millions of user-generated videos, the platform receives a high volume of user reports flagging content as “claims” (factual assertions) or “opinions.” Moderators face a growing backlog, slowing response times and affecting user trust.

2/8/20263 min read

Exploratory data analysis and visualization to support automated content moderation.

Business Context / Overview

Background:
TikTok’s mission is to inspire creativity and bring joy. With millions of user-generated videos, the platform receives a high volume of user reports flagging content as “claims” (factual assertions) or “opinions.” Moderators face a growing backlog, slowing response times and affecting user trust.

Why this analysis was required:
To build a predictive model that automatically classifies videos, reducing manual review time and improving moderation efficiency.

Who benefits:

TikTok’s Content Moderation Team
Data Science and Operations Teams
End-users through faster, more accurate content handling

Problem Statement

TikTok’s manual review process for user-reported videos is inefficient and unsustainable due to volume. Without a clear, data-driven understanding of what distinguishes a “claim” from an “opinion,” the team cannot build an accurate classification model, leading to delayed moderation, potential misinformation spread, and user dissatisfaction.

Objectives

Perform comprehensive EDA to uncover patterns in video metrics between claims and opinions.
Identify key distinguishing features (e.g., view count, author status, engagement metrics) that could predict video type.
Deliver clear, accessible visualizations for both technical and non-technical stakeholders.
Prepare clean, analysis-ready data for future machine learning modeling.

Dataset Description

Source: Internal TikTok dataset (tiktok_dataset.csv)
Records: 19,382 videos
Features: 12 columns including:

claim_status (claim/opinion)
video_view_count, video_like_count, video_share_count
author_ban_status, verified_status
video_duration_sec, video_transcription_text
Time Period: Not specified (static snapshot)

Data Preparation & Cleaning

Handled missing values in claim_status and engagement metrics.
Verified and corrected data types for numerical vs. categorical fields.
Assessed outliers using IQR and median-based thresholds.
Standardized column names and ensured consistent formatting.

Exploratory Data Analysis (EDA)

Distribution analysis of video duration, views, likes, shares, downloads, and comments.
Comparison of claims vs. opinions across verified status and ban status.
Outlier detection for engagement metrics using non-parametric methods.
Correlation exploration between view count, likes, and claim status.

Analytical Approach

Statistical summaries using .describe() and .groupby() methods.
Visual distribution checks via boxplots and histograms.
Segmentation analysis by author ban status and verification status.
Threshold-based outlier identification using median + 1.5 * IQR.

Key Insights

Claim videos receive significantly higher median views (~501k) compared to opinion videos (~4,953).
Verified users are more likely to post opinions, while unverified users post more claims.
Authors under review or banned post more claim videos and receive disproportionately high view counts.
Engagement metrics (likes, shares, downloads) are highly right-skewed, with a small percentage of videos driving most engagement.

Visualizations & Reporting

Python (Matplotlib/Seaborn):
- Boxplots & histograms for all key metrics
- Claim vs. opinion bar charts by author status
- Scatter plots of views vs. likes colored by claim status

Video Type Distribution:

What percentage of videos are claims vs. opinions?

Engagement Metrics Analysis:

View Count Distribution by Video Type

Author Status Breakdown:

Correlation And Relationship Analysis :

Correlation Heatmap

Views vs Likes Scatter with Trendline:

Outlier Analysis:

Outlier Detection Across Metrics

Summary Dashboard:

Executive Summary Visualization

Accessibility-Friendly Version:

For stakeholders with visual impairments

Tools & Tech Stack

Language: Python
Libraries: Pandas, NumPy, Matplotlib, Seaborn
Environment: Jupyter Notebook
Visualization Tool: Tableau Public
Data Storage: CSV

Challenges & Limitations

Highly skewed engagement data required non-standard outlier treatment.
Missing values in key columns reduced usable sample size.
Text data (video_transcription_text) was not analyzed in this phase.
Domain knowledge required to interpret “ban status” and “verified status” accurately.

Results & Business Impact

Clear feature understanding now guides model feature selection.
Stakeholder alignment achieved through accessible Tableau dashboards.
Moderation team can now prioritize videos based on risk indicators (e.g., high-view claims from unverified authors).
Data pipeline readiness for building a classification model.

Recommendations

Build a binary classification model using view_count, author_ban_status, and verified_status as top features.
Implement automated outlier flagging for viral content to prevent model bias.
Expand analysis to text data (NLP) for transcription-based classification.
Create real-time dashboards for ongoing monitoring of claim vs. opinion trends.

Key Learnings

Business context is critical—understanding TikTok’s moderation needs shaped the analysis direction.
Right-skewed social media data requires tailored statistical approaches.
Visualization accessibility ensures insights are actionable for all stakeholders.
Clean, documented EDA accelerates downstream modeling and decision-making.