Why Stories Need Data
When I see several similar posts on social media, I often ask the question, "Is that a trend or is it just me?" (In fact, I did that recently, when I started seeing a lot of fairly gloomy AI-related headlines.) While it's tempting to make assumptions based on what's in your feed, or what Google has served up on your home page, data is often a better way to answer questions about what groups of people are feeling, saying, or doing.
In my work as a marketing writer, I also see data as anchoring the most valuable thought leadership content. Sure, you can sketch out industry trends based on observations and gut feelings. But data, especially when it reveals clear patterns, is what separates speculation from authority.
Why Build Your Own Custom Dataset?
Both in my personal writing and my work for clients, one of the biggest obstacles to writing a great piece is finding the right data. In many cases, interesting data and insights are locked behind paywalls. And the free datasets you can find on Kaggle and GitHub are often out-of-date or not quite what you want.
After spending hours searching the internet for relevant data, I looked into collecting and analyzing my own data as a more flexible and affordable way to test assumptions and flesh out stories.
Why This Approach Works Now
Recent advances in natural language processing have made large-scale text analysis a lot easier to conduct, even if you're not a machine learning expert. (If you're new to the subject and have a little exposure to programming with Python, Google's crash course is a good place to start.) Pre-trained models can now analyze sentiment, detect emotions, and categorize themes across hundreds of text blocks in minutes.
At the same time, today's public platforms contain unprecedented volumes of authentic user-generated content, and some of it can be accessed for no or low cost through APIs. This has opened up new opportunities to conduct meaningful analysis on topics where formal research doesn't exist or fails to capture the complete picture.
What You'll Learn
This guide provides a quick overview of what you need to build your own custom datasets and covers:
- Getting AI assistance for technical setup and installation
- Ethical data collection from public sources, with a focus on Reddit
- Understanding Hugging Face models for sentiment analysis
- Using AI to build analysis scripts that identify patterns in text
01 Get Technical Help from AI
Before creating your own custom datasets, you'll need to install Python and several libraries (e.g., collections of pre-written code)on your local machine. Installation processes vary significantly between Windows, Mac, and different system configurations. Fortunately, AI assistants like ChatGPT or Claude can provide personalized instructions based on your exact setup and troubleshoot errors in real-time. And they have nearly infinite patience for beginners.
In fact, following instructions from ChatGPT is how I got started with Python last year when I was interested in running scripts and learning about classification models.
Comprehensive Setup Prompt
Use this prompt with ChatGPT, Claude, or another AI assistant to get step-by-step installation guidance:
Complete Setup Prompt
02 Stay Out of Trouble
One of my biggest concerns getting started with data collection was staying compliant with providers' terms of service (TOS). Most platforms have specific rules about automated data collection, and violating these can result in legal issues or account bans. I'm always careful to check these rules before gathering any data, and you should be, too. 😄
Reddit: The Ideal Starting Point
Reddit is great place to start collecting data since it offers clear legal guidelines for researchers as well as an official API (Application Programming Interface). It's also free to use if you respect rate limits (e.g., don't pull down too many records per minute).
Reddit's Terms of Service: What You Need to Know
Reddit's API TOS explicitly allow research and analysis but with some restrictions:
- Rate Limiting: Maximum 100 requests per minute to prevent server overload
- User Respect: Must honor user account deletions and content removal.
- Attribution: Data must be used for analysis, not re-publication without context
- Commercial Use: Selling raw Reddit data is prohibited
- Privacy: Cannot collect private messages, even if you have access
Reminder: Delete Your Reddit Data Promptly
One important thing to keep in mind is that you must delete any data in your possession that has been deleted from Reddit, including posts, comments, user IDs, etc. Since users may delete their data at any time, this means that you should delete your data right after you complete your analysis.
Reddit's Data API Wiki suggests deleting data within 48 hours of collecting it.
Other Platform Considerations
- Twitter/X: Now requires paid API access
- LinkedIn: Explicitly prohibits automated data collection
- Instagram/Facebook: Heavily restricted API with limited research access
- TikTok: No public research API available
Setting Up Access to Reddit's Public API
To get access to Reddit data, you'll need to register the Python script you will use to collect it. This is free and straightforward:
Create Reddit Account: If you don't have one, sign up at reddit.com
Register Application: Go to https://www.reddit.com/prefs/apps/ and create a new "script" application. Do not choose "app" which requires formal API authorization from Reddit.
Get Credentials: Save your client ID (under app name) and client secret (marked "secret")
Identify Your Data Collection Goals
Since the free public API is a shared resource, it's important to be strategic about your search terms and subreddit selection. Generally speaking, you'll want to collect no more than a few thousand posts or comments.
In my experience, the most effective approach is to focus on 3-5 relevant subreddits and use specific keywords that will surface posts where people are actually discussing your topic in detail, rather than just mentioning it in passing.
Create Your Reddit Data Collection Script
Once you've identified some keywords and subreddits, you can create your data collection script. The script will use the PRAW (Python Reddit API Wrapper) to interact with Reddit’s API. It allows developers and researchers to easily access Reddit data, including posts, comments, user info, and subreddit content, without needing to manually construct API requests.
PRAW can scrape posts from specific subreddits, perform keyword or full-text search, and analyze comment threads. It also includes internal logic to help prevent violations of Reddit’s API rules.
Use the prompt shown below with your favorite chatbot to write your data collection script. You should save it with the extension .py and do not include any spaces in the file name.
Reddit Collection Prompt
Ethical Data Collection Principles
- Public Only: Collect only from truly public posts, not private groups or messages
- Respect Rate Limits: Don't overwhelm servers with rapid requests
- Anonymization: Remove or hash usernames before analysis
- Purpose Limitation: Use data only for stated research purposes
- Data Security: Store collected data securely and delete when no longer needed
03Analyze Your Dataset
After running your script, you will see a .CSV file in the directory where your script is stored. You'll be able to open it in Excel, search for keywords, and run some basic analysis. However, if you want to classify, say, 500 or more posts as having a particular emotional tone, you'll save a lot of time by using machine learning models. This is where Hugging Face comes in handy.
What is Hugging Face?
Hugging Face is a platform that hosts thousands of pre-trained machine-learning models. Instead of spending months training your own AI to understand sentiment, you can use models that have already been trained on millions of text examples. These models already "know" how to read emotions, detect topics, and understand language nuances.
Pre-trained vs Custom Models
Training a sentiment analysis model from scratch would require thousands of labeled examples and weeks of computational time. Pre-trained models give you professional-grade analysis capabilities immediately.
What is Sentiment Analysis?
Sentiment analysis is the process of determining the emotional tone behind text. It answers questions like "Is this post positive, negative, or neutral?" and "How confident are we in that assessment?"
Modern sentiment analysis goes beyond simple positive/negative classifications. Advanced models can detect:
- Basic Sentiment: Positive, Negative, Neutral
- Emotions: Joy, anger, fear, surprise, sadness, disgust
- Complex Emotions: Frustration, excitement, skepticism, hope
- Intensity: How strongly the emotion is expressed
- Mixed Emotions: When text contains multiple conflicting sentiments
Types of Sentiment Analysis Models
Choosing the Right Model
For your first project, you'll probably want to start with a model that's proven to deliver reliable results. Here are a few examples:
For General Business Analysis: Consider starting with "cardiffnlp/twitter-roberta-base-sentiment-latest". This model, which was downloaded more than 2 million times last month, works well for social media posts, customer feedback, and general business content. It's fast and provides reliable positive/negative/neutral classifications with confidence scores.
For Detailed Emotional Analysis: Consider starting with "SamLowe/roberta-base-go_emotions". This model, which was downloaded more than 500K times last month, can detect 28 different emotions including admiration, amusement, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise, neutral.
For Product/Service Reviews: Consider starting with "nlptown/bert-base-multilingual-uncased-sentiment". Specifically trained on product reviews, this model, which was downloaded more than 1.5 million times last month , understands customer satisfaction nuances and rates sentiment on a 1-5 star scale.
Building Sentiment Analysis Scripts
After selecting your analysis goals and choosing between basic sentiment detection or more granular emotion analysis, you'll need a script that will pass your dataset to your model. Depending on your model, you may also need to install another Python library or two.
Here are prompts to create different types of analysis scripts:
Basic Sentiment Analysis Prompt
Advanced Emotion Analysis Prompt
Time-Series Sentiment Analysis Prompt
Sentiment analysis models typically achieve 75-85% accuracy on real-world data. They struggle with sarcasm, context-dependent meaning, and domain-specific language. Always manually review a sample of results, perhaps 50 or 100 records, to get a feel for your model's accuracy for your specific use case.
04 Dos and Don'ts for Data Collection
Technical Best Practices
Legal and Ethical Guidelines
Understanding Data Bias
Sample Bias: Reddit users skew younger, more technical, and more politically liberal than the general population. Different subreddits have distinct cultural norms and demographics. Always acknowledge these limitations explicitly and avoid overgeneralizing findings to broader populations.
Temporal Bias: Social media discussions can be heavily influenced by recent news events, viral posts, or trending topics that may not represent long-term sentiment. Collect data over longer time periods and identify potential external influences that might skew your results.
Best Practice: Present your findings as "insights from [specific communities]" rather than universal truths about entire demographics or markets.
05 When to Build vs. Buy
At this point, you might be wondering whether it's worth building your own data collection system or just paying for professional research. The answer depends on your specific needs, timeline, and resources. Custom data collection isn't always the right choice, but when it is, it can provide unique insights that aren't available anywhere else.
The key is understanding when the DIY approach makes strategic sense versus when you're better off investing in professional services or existing datasets.
DIY might make sense when:
- You need data on emerging topics where formal research doesn't exist
- You have basic technical comfort and time to learn
- The data collection scope is manageable (thousands, not millions of records)
- Your research questions are highly specific to a specific industry or niche
Consider Professional Help When:
- Legal compliance requirements are complex (healthcare, finance, etc.)
- You need real-time data processing or advanced statistical analysis
- The project timeline is tight and technical troubleshooting isn't feasible
- Data sources require sophisticated scraping techniques or specialized access
Hybrid Approach
Many projects benefit from DIY data collection combined with professional analysis. Collect the raw data using these techniques, then engage a data scientist to select and run models and interpret the results.
From Questions to Insights
Remember that question from the beginning: "Is that a trend or is it just me?" Custom data collection gives you a way to move beyond speculation and gut feelings to find real patterns in how people discuss and experience topics.
The approach outlined in this guide won't give you the statistical rigor of academic research, but it will help you identify signals in the noise. Whether you're trying to understand how customers really feel about a product category, track emerging concerns in your industry, or spot opportunities that haven't hit mainstream research yet, these techniques can provide insights that will allow you to write with authority.
The key is starting small and frequently checking your results. If one particular model is struggling to accurately interpret your data, consider choosing another to test. Most importantly, always be transparent about your methodology and limitations—good analysis acknowledges what it can and cannot prove.