Why Stories Need Data

When I see several similar posts on social media, I often ask the question, "Is that a trend or is it just me?" (In fact, I did that recently, when I started seeing a lot of fairly gloomy AI-related headlines.) While it's tempting to make assumptions based on what's in your feed, or what Google has served up on your home page, data is often a better way to answer questions about what groups of people are feeling, saying, or doing.

In my work as a marketing writer, I also see data as anchoring the most valuable thought leadership content. Sure, you can sketch out industry trends based on observations and gut feelings. But data, especially when it reveals clear patterns, is what separates speculation from authority.

Why Build Your Own Custom Dataset?

Both in my personal writing and my work for clients, one of the biggest obstacles to writing a great piece is finding the right data. In many cases, interesting data and insights are locked behind paywalls. And the free datasets you can find on Kaggle and GitHub are often out-of-date or not quite what you want.

After spending hours searching the internet for relevant data, I looked into collecting and analyzing my own data as a more flexible and affordable way to test assumptions and flesh out stories.

Why This Approach Works Now

Recent advances in natural language processing have made large-scale text analysis a lot easier to conduct, even if you're not a machine learning expert. (If you're new to the subject and have a little exposure to programming with Python, Google's crash course is a good place to start.) Pre-trained models can now analyze sentiment, detect emotions, and categorize themes across hundreds of text blocks in minutes.

At the same time, today's public platforms contain unprecedented volumes of authentic user-generated content, and some of it can be accessed for no or low cost through APIs. This has opened up new opportunities to conduct meaningful analysis on topics where formal research doesn't exist or fails to capture the complete picture.

🎯 What You'll Learn

This guide provides a quick overview of what you need to build your own custom datasets and covers:

  • Getting AI assistance for technical setup and installation
  • Ethical data collection from public sources, with a focus on Reddit
  • Understanding Hugging Face models for sentiment analysis
  • Using AI to build analysis scripts that identify patterns in text

01 Get Technical Help from AI

Before creating your own custom datasets, you'll need to install Python and several libraries (e.g., collections of pre-written code)on your local machine. Installation processes vary significantly between Windows, Mac, and different system configurations. Fortunately, AI assistants like ChatGPT or Claude can provide personalized instructions based on your exact setup and troubleshoot errors in real-time. And they have nearky infinite patience for beginners.

In fact, following instructions from ChatGPT is how I got started with Python last year when I was interested in running scripts and learning about classification models.

Comprehensive Setup Prompt

Use this prompt with ChatGPT, Claude, or another AI assistant to get step-by-step installation guidance:

Complete Setup Prompt

I need help setting up a Python environment for data collection and sentiment analysis. My system is [Windows 10/11 OR Mac OR Linux]. I'm a [beginner/intermediate] with programming. Please provide step-by-step instructions to install: 1. Python (latest stable version) 2. These libraries: pandas, requests, beautifulsoup4, praw (for Reddit), transformers, torch 3. A code editor recommendation for beginners After installation, please provide a simple test script to verify everything works correctly. If I encounter any errors during installation, I'll share the error message with you for troubleshooting. Also include: - How to install packages using pip - How to run Python scripts from command line - Best practices for managing Python environments Format your response with clear numbered steps and code blocks I can copy/paste.

02 Stay Out of Trouble

One of my biggest concerns getting started with data collection was staying compliant with providers' terms of service (TOS). Most platforms have specific rules about automated data collection, and violating these can result in legal issues or account bans. I'm always careful to check these rules before gathering any data, and you should be, too. 😄

Reddit: The Ideal Starting Point

Reddit is great place to start collecting data since it offers clear legal guidelines for researchers as well as an official API (Application Programming Interface). It's also free to use if you're collecting no more than 1,000 post records a date and you respect rate limits (e.g., don't pull down too many records per minute).

Reddit Advantages
Free API access, clear terms of service, organized by topic (subreddits), rich discussion content
Reddit Requirements
Must register application, respect rate limits (100 requests per minute), cannot collect private messages
Reddit Restrictions
No selling of data, no harassment of users, must respect user deletions

Reddit's Terms of Service: What You Need to Know

Reddit's API Terms explicitly allow research and analysis but with important restrictions:

  • Rate Limiting: Maximum 100 requests per minute to prevent server overload
  • Attribution: Data must be used for analysis, not re-publication without context
  • Commercial Use: Selling raw Reddit data is prohibited
  • Privacy: Cannot collect private messages, even if you have access
  • User Respect: Must honor user account deletions and content removal

Other Platform Considerations

  • Twitter/X: Now requires paid API access
  • LinkedIn: Explicitly prohibits automated data collection
  • Instagram/Facebook: Heavily restricted API with limited research access
  • TikTok: No public research API available

Setting Up Reddit API Access

To get API credentials that will give you access to Reddit data, you'll need to register as a Reddit developer. This is free and straightforward:

1

Create Reddit Account: If you don't have one, sign up at reddit.com

2

Register Application: Go to https://www.reddit.com/prefs/apps/ and create a new "script" application

3

Get Credentials: Save your client ID (under app name) and client secret (marked "secret")

Create Your Reddit Data Collection Script

Once you have a Reddit Developer account, it's time to start thinking about what kind of data you'd like to collect. If you want to stay under the 1,000 records a day limit for free access and still conduct a meaningful analysis, it's important to be strategic about your search terms and subreddit selection.

In my experience, the most effective approach is to focus on 3-5 relevant subreddits and use specific keywords that will surface posts where people are actually discussing your topic in detail, rather than just mentioning it in passing.

Use this prompt to get a custom Reddit data collection script:

Reddit Collection Prompt

Please create a Python script to collect Reddit data using the PRAW library. Here are my requirements: **Data Collection Goals:** - Topic I'm researching: [YOUR TOPIC] - Subreddits to search: [LIST SUBREDDITS] - Keywords to search for: [YOUR KEYWORDS] - Time period: [past week/month/year] - Approximate number of posts needed: [100/500/1000] **Script Requirements:** 1. Use PRAW library for Reddit API access 2. Include proper rate limiting (60 requests/minute) 3. Collect: post title, content, score, comment count, date, subreddit 4. Save data to CSV file with proper formatting 5. Include error handling for API limits and network issues 6. Add privacy protection (anonymize usernames) 7. Collect no more than 1,000 posts. Stay within free limits. **Technical Details:** - Include setup for Reddit API credentials (client_id, client_secret, user_agent) - Add progress indicators so I can see collection status - Include data cleaning (remove deleted/removed posts) - Filter out posts shorter than 50 characters Please provide: 1. Complete Python script with comments 2. Instructions for setting up Reddit API credentials 3. Example of how to run the script 4. Common error messages and solutions

Ethical Data Collection Principles

  • Public Only: Collect only from truly public posts, not private groups or messages
  • Respect Rate Limits: Don't overwhelm servers with rapid requests
  • Anonymization: Remove or hash usernames before analysis
  • Purpose Limitation: Use data only for stated research purposes
  • Data Security: Store collected data securely and delete when no longer needed

03Analyze Your Dataset

Once you have your text data as a CSV file, you'll be able to open in in Excel, search for keywords, and run some basic analysis. However, if you want to classify, say, 500 or more posts as having a particular emotional tone, you'll save a lot of time by using machine learning models. This is where Hugging Face comes in handy.

What is Hugging Face?

Hugging Face is a platform that hosts thousands of pre-trained machine-learning models. Instead of spending months training your own AI to understand sentiment, you can use models that have already been trained on millions of text examples. These models already "know" how to read emotions, detect topics, and understand language nuances.

🤖

Pre-trained vs Custom Models

Training a sentiment analysis model from scratch would require thousands of labeled examples and weeks of computational time. Pre-trained models give you professional-grade analysis capabilities immediately.

What is Sentiment Analysis?

Sentiment analysis is the process of determining the emotional tone behind text. It answers questions like "Is this post positive, negative, or neutral?" and "How confident are we in that assessment?"

Modern sentiment analysis goes beyond simple positive/negative classifications. Advanced models can detect:

  • Basic Sentiment: Positive, Negative, Neutral
  • Emotions: Joy, anger, fear, surprise, sadness, disgust
  • Complex Emotions: Frustration, excitement, skepticism, hope
  • Intensity: How strongly the emotion is expressed
  • Mixed Emotions: When text contains multiple conflicting sentiments

Types of Sentiment Analysis Models

Basic Sentiment
Models like "cardiffnlp/twitter-roberta-base-sentiment-latest." Fast and reliable for positive/negative/neutral classification.
Emotion Detection
Models like "SamLowe/roberta-base-go_emotions." Detect specific emotions including joy, anger, curiosity, optimism.

Industry-Specific
Models trained on financial, medical, or product review text. Better accuracy for specialized domains.
Multilingual
Models that work across multiple languages. Useful for global data collection.

Choosing the Right Model

For your first project, you'll probably want to start with a model that's proven to deliver reliable results. Here are a few examples:

For General Business Analysis: Consider starting with "cardiffnlp/twitter-roberta-base-sentiment-latest". This model, which was downloaded more than 2 million times last month, works well for social media posts, customer feedback, and general business content. It's fast and provides reliable positive/negative/neutral classifications with confidence scores.

For Detailed Emotional Analysis: Consider starting with "SamLowe/roberta-base-go_emotions". This model, which was downloaded more than 500K times last month, can detect 28 different emotions including admiration, amusement, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise, neutral.

For Product/Service Reviews: Consider starting with "nlptown/bert-base-multilingual-uncased-sentiment". Specifically trained on product reviews, this model, which was downloaded more than 1.5 million times last month , understands customer satisfaction nuances and rates sentiment on a 1-5 star scale.

Building Sentiment Analysis Scripts

After selecting your analysis goals and choosing between basic sentiment detection or more granular emotion analysis, you'll need a script that will pass your dataset to your model. Depending on your model, you may also need to install another Python library or two.

Here are prompts to create different types of analysis scripts:

Basic Sentiment Analysis Prompt

Please build a Python script that performs sentiment analysis on a CSV file of text data. Requirements: **Input Data:** - CSV file with columns: [specify your column names, e.g., "content", "title", "date" in the prompt] - Text column to analyze: [specify which column contains the text] - Approximately [NUMBER] rows of data **Analysis Requirements:** 1. Use Hugging Face transformers library 2. Use the "cardiffnlp/twitter-roberta-base-sentiment-latest" model 3. Analyze [text column] and add new columns for: - sentiment_label (POSITIVE/NEGATIVE/NEUTRAL) - sentiment_score (confidence level 0-1) 4. Handle empty/short text gracefully 5. Process data in batches to avoid memory issues 6. Save results to new CSV file **Additional Features:** - Progress bar showing analysis status - Error handling for problematic text - Summary statistics (% positive, negative, neutral) - Basic visualization of results (bar chart) **Output:** - Complete Python script with comments - Instructions for installing required packages - Example of running the script - Explanation of output columns

Advanced Emotion Analysis Prompt

Create a Python script for detailed emotion analysis using Hugging Face models. Specifications: **Input:** - CSV file with text data from [Reddit/reviews/social media] - Text column: [column name] - Minimum text length for analysis: 20 characters **Emotion Analysis:** 1. Use "SamLowe/roberta-base-go_emotions" model 2. Extract top 3 emotions for each text with confidence scores 3. Categorize emotions into groups: - Positive: joy, excitement, optimism, love, admiration, approval, caring, gratitude, pride, relief - Negative: anger, annoyance, disappointment, disapproval, disgust, embarrassment, fear, grief, nervousness, remorse, sadness - Neutral: neutral, realization, confusion, curiosity, surprise - Uncertain: amusement, desire **Output Columns:** - primary_emotion, primary_confidence - secondary_emotion, secondary_confidence - tertiary_emotion, tertiary_confidence - emotion_category (positive/negative/neutral/uncertain) - mixed_emotions (boolean - if multiple strong emotions detected) **Analysis Features:** - Batch processing for efficiency - Summary report with top emotions and categories - Export results to CSV and generate emotion distribution chart - Handle edge cases (very short text, non-English text) Please include complete code with error handling and clear documentation.

Time-Series Sentiment Analysis Prompt

Build a script that analyzes how sentiment changes over time. Requirements: **Data Structure:** - CSV with text data and timestamps - Date column: [column name] in format [YYYY-MM-DD or timestamp] - Text column: [column name] - Optional grouping column: [subreddit/source/category] **Analysis Goals:** 1. Perform sentiment analysis on text data 2. Group results by time periods: daily, weekly, monthly 3. Calculate sentiment trends over time 4. Identify significant changes or events 5. Compare sentiment across different groups/sources **Specific Outputs:** - Time series data showing sentiment percentages over time - Trend analysis (improving/declining sentiment) - Peak positive and negative periods - Moving averages (7-day, 30-day) - Statistical significance of changes **Visualizations:** - Line chart of sentiment over time - Stacked bar chart showing positive/negative/neutral by period - Heatmap if multiple groups are compared - Highlight periods with significant sentiment changes **Technical Requirements:** - Use pandas for date handling and grouping - Include statistical tests for trend significance - Handle irregular time intervals in data - Export results as CSV and generate publication-ready charts Please provide complete code with sample data processing and clear explanations of the analysis methods.
💡 Model Performance Reality Check

Sentiment analysis models typically achieve 75-85% accuracy on real-world data. They struggle with sarcasm, context-dependent meaning, and domain-specific language. Always manually review a sample of results, perhaps 50 or 100 records, to get a feel for your model's accuracy for your specific use case.

04 Dos and Don'ts for Data Collection

Technical Best Practices

✅ DO
Manually review a random sample (50-100 posts) to understand your model's accuracy for your specific content and use case.
❌ DON'T
Assume sentiment models are 100% accurate. They typically achieve 70-85% accuracy and struggle with sarcasm.
✅ DO
Implement delays between API requests and respect rate limits (Reddit: max 100 requests/minute).
❌ DON'T
Make rapid-fire API requests. This will get you temporarily banned from most platforms.

Legal and Ethical Guidelines

✅ DO
Focus on sentiment patterns, aggregated insights, and anonymized data in your analysis.
❌ DON'T
Republish large portions of original posts. This violates copyright and can harm users.
✅ DO
Remove or anonymize usernames, hash personal identifiers, and protect individual privacy.
❌ DON'T
Include real usernames or personal details in published analysis, even from "public" posts.
✅ DO
Check platform terms of service and use official APIs when available.
❌ DON'T
Scrape data from platforms that explicitly prohibit it (LinkedIn, Instagram, Facebook).

Understanding Data Bias

Sample Bias: Reddit users skew younger, more technical, and more politically liberal than the general population. Different subreddits have distinct cultural norms and demographics. Always acknowledge these limitations explicitly and avoid overgeneralizing findings to broader populations.

Temporal Bias: Social media discussions can be heavily influenced by recent news events, viral posts, or trending topics that may not represent long-term sentiment. Collect data over longer time periods and identify potential external influences that might skew your results.

Best Practice: Present your findings as "insights from [specific communities]" rather than universal truths about entire demographics or markets.

05 When to Build vs. Buy

At this point, you might be wondering whether it's worth building your own data collection system or just paying for professional research. The answer depends on your specific needs, timeline, and resources. Custom data collection isn't always the right choice, but when it is, it can provide unique insights that aren't available anywhere else.

The key is understanding when the DIY approach makes strategic sense versus when you're better off investing in professional services or existing datasets.

DIY might make sense when:

  • You need data on emerging topics where formal research doesn't exist
  • You have basic technical comfort and time to learn
  • The data collection scope is manageable (thousands, not millions of records)
  • Your research questions are highly specific to a specific industry or niche

Consider Professional Help When:

  • Legal compliance requirements are complex (healthcare, finance, etc.)
  • You need real-time data processing or advanced statistical analysis
  • The project timeline is tight and technical troubleshooting isn't feasible
  • Data sources require sophisticated scraping techniques or specialized access
🔧

Hybrid Approach

Many projects benefit from DIY data collection combined with professional analysis. Collect the raw data using these techniques, then engage a data scientist to select and run models and interpret the results.

🎯 From Questions to Insights

Remember that question from the beginning: "Is that a trend or is it just me?" Custom data collection gives you a way to move beyond speculation and gut feelings to find real patterns in how people discuss and experience topics.

The approach outlined in this guide won't give you the statistical rigor of academic research, but it will help you identify signals in the noise. Whether you're trying to understand how customers really feel about a product category, track emerging concerns in your industry, or spot opportunities that haven't hit mainstream research yet, these techniques can provide insights that will allow you to write with authority.

The key is starting small and frequently checking your results. If one particular model is struggling to accurately interpret your data, consider choosing another to test. Most importantly, always be transparent about your methodology and limitations—good analysis acknowledges what it can and cannot prove.

ABOUT THE AUTHOR
Karen Spinner is a B2B content strategist and founder of Good Content. She helps tech companies create research-backed content through a human-led, AI-enhanced creative process. When existing data doesn't tell the full story, she uses machine learning combined with custom datasets to uncover insights that spark meaningful conversations.
Outside of work, Karen is curious about how AI is changing professional and personal experiences in ways that go beyond typical business case studies.
Have questions about this content or just want to say hi?
Email