Why Stories Need Data
When I see several similar posts on social media, I often ask the question, "Is that a trend or is it just me?" (In fact, I did that recently, when I started seeing a lot of fairly gloomy AI-related headlines.) While it's tempting to make assumptions based on what's in your feed, or what Google has served up on your home page, data is often a better way to answer questions about what groups of people are feeling, saying, or doing.
In my work as a marketing writer, I also see data as anchoring the most valuable thought leadership content. Sure, you can sketch out industry trends based on observations and gut feelings. But data, especially when it reveals clear patterns, is what separates speculation from authority.
Why Build Your Own Custom Dataset?
Both in my personal writing and my work for clients, one of the biggest obstacles to writing a great piece is finding the right data. In many cases, interesting data and insights are locked behind paywalls. And the free datasets you can find on Kaggle and GitHub are often out-of-date or not quite what you want.
After spending hours searching the internet for relevant data, I looked into collecting and analyzing my own data as a more flexible and affordable way to test assumptions and flesh out stories.
Why This Approach Works Now
Recent advances in natural language processing have made large-scale text analysis a lot easier to conduct, even if you're not a machine learning expert. (If you're new to the subject and have a little exposure to programming with Python, Google's crash course is a good place to start.) Pre-trained models can now analyze sentiment, detect emotions, and categorize themes across hundreds of text blocks in minutes.
At the same time, today's public platforms contain unprecedented volumes of authentic user-generated content, and some of it can be accessed for no or low cost through APIs. This has opened up new opportunities to conduct meaningful analysis on topics where formal research doesn't exist or fails to capture the complete picture.
What You'll Learn
This guide provides a quick overview of what you need to build your own custom datasets and covers:
- Getting AI assistance for technical setup and installation
- Ethical data collection from public sources, with a focus on Reddit
- Understanding Hugging Face models for sentiment analysis
- Using AI to build analysis scripts that identify patterns in text
01 Get Technical Help from AI
Before creating your own custom datasets, you'll need to install Python and several libraries (e.g., collections of pre-written code)on your local machine. Installation processes vary significantly between Windows, Mac, and different system configurations. Fortunately, AI assistants like ChatGPT or Claude can provide personalized instructions based on your exact setup and troubleshoot errors in real-time. And they have nearky infinite patience for beginners.
In fact, following instructions from ChatGPT is how I got started with Python last year when I was interested in running scripts and learning about classification models.
Comprehensive Setup Prompt
Use this prompt with ChatGPT, Claude, or another AI assistant to get step-by-step installation guidance:
Complete Setup Prompt
02 Stay Out of Trouble
One of my biggest concerns getting started with data collection was staying compliant with providers' terms of service (TOS). Most platforms have specific rules about automated data collection, and violating these can result in legal issues or account bans. I'm always careful to check these rules before gathering any data, and you should be, too. 😄
Reddit: The Ideal Starting Point
Reddit is great place to start collecting data since it offers clear legal guidelines for researchers as well as an official API (Application Programming Interface). It's also free to use if you're collecting no more than 1,000 post records a date and you respect rate limits (e.g., don't pull down too many records per minute).
Reddit's Terms of Service: What You Need to Know
Reddit's API Terms explicitly allow research and analysis but with important restrictions:
- Rate Limiting: Maximum 100 requests per minute to prevent server overload
- Attribution: Data must be used for analysis, not re-publication without context
- Commercial Use: Selling raw Reddit data is prohibited
- Privacy: Cannot collect private messages, even if you have access
- User Respect: Must honor user account deletions and content removal
Other Platform Considerations
- Twitter/X: Now requires paid API access
- LinkedIn: Explicitly prohibits automated data collection
- Instagram/Facebook: Heavily restricted API with limited research access
- TikTok: No public research API available
Setting Up Reddit API Access
To get API credentials that will give you access to Reddit data, you'll need to register as a Reddit developer. This is free and straightforward:
Create Reddit Account: If you don't have one, sign up at reddit.com
Register Application: Go to https://www.reddit.com/prefs/apps/ and create a new "script" application
Get Credentials: Save your client ID (under app name) and client secret (marked "secret")
Create Your Reddit Data Collection Script
Once you have a Reddit Developer account, it's time to start thinking about what kind of data you'd like to collect. If you want to stay under the 1,000 records a day limit for free access and still conduct a meaningful analysis, it's important to be strategic about your search terms and subreddit selection.
In my experience, the most effective approach is to focus on 3-5 relevant subreddits and use specific keywords that will surface posts where people are actually discussing your topic in detail, rather than just mentioning it in passing.
Use this prompt to get a custom Reddit data collection script:
Reddit Collection Prompt
Ethical Data Collection Principles
- Public Only: Collect only from truly public posts, not private groups or messages
- Respect Rate Limits: Don't overwhelm servers with rapid requests
- Anonymization: Remove or hash usernames before analysis
- Purpose Limitation: Use data only for stated research purposes
- Data Security: Store collected data securely and delete when no longer needed
03Analyze Your Dataset
Once you have your text data as a CSV file, you'll be able to open in in Excel, search for keywords, and run some basic analysis. However, if you want to classify, say, 500 or more posts as having a particular emotional tone, you'll save a lot of time by using machine learning models. This is where Hugging Face comes in handy.
What is Hugging Face?
Hugging Face is a platform that hosts thousands of pre-trained machine-learning models. Instead of spending months training your own AI to understand sentiment, you can use models that have already been trained on millions of text examples. These models already "know" how to read emotions, detect topics, and understand language nuances.
Pre-trained vs Custom Models
Training a sentiment analysis model from scratch would require thousands of labeled examples and weeks of computational time. Pre-trained models give you professional-grade analysis capabilities immediately.
What is Sentiment Analysis?
Sentiment analysis is the process of determining the emotional tone behind text. It answers questions like "Is this post positive, negative, or neutral?" and "How confident are we in that assessment?"
Modern sentiment analysis goes beyond simple positive/negative classifications. Advanced models can detect:
- Basic Sentiment: Positive, Negative, Neutral
- Emotions: Joy, anger, fear, surprise, sadness, disgust
- Complex Emotions: Frustration, excitement, skepticism, hope
- Intensity: How strongly the emotion is expressed
- Mixed Emotions: When text contains multiple conflicting sentiments
Types of Sentiment Analysis Models
Choosing the Right Model
For your first project, you'll probably want to start with a model that's proven to deliver reliable results. Here are a few examples:
For General Business Analysis: Consider starting with "cardiffnlp/twitter-roberta-base-sentiment-latest". This model, which was downloaded more than 2 million times last month, works well for social media posts, customer feedback, and general business content. It's fast and provides reliable positive/negative/neutral classifications with confidence scores.
For Detailed Emotional Analysis: Consider starting with "SamLowe/roberta-base-go_emotions". This model, which was downloaded more than 500K times last month, can detect 28 different emotions including admiration, amusement, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise, neutral.
For Product/Service Reviews: Consider starting with "nlptown/bert-base-multilingual-uncased-sentiment". Specifically trained on product reviews, this model, which was downloaded more than 1.5 million times last month , understands customer satisfaction nuances and rates sentiment on a 1-5 star scale.
Building Sentiment Analysis Scripts
After selecting your analysis goals and choosing between basic sentiment detection or more granular emotion analysis, you'll need a script that will pass your dataset to your model. Depending on your model, you may also need to install another Python library or two.
Here are prompts to create different types of analysis scripts:
Basic Sentiment Analysis Prompt
Advanced Emotion Analysis Prompt
Time-Series Sentiment Analysis Prompt
Sentiment analysis models typically achieve 75-85% accuracy on real-world data. They struggle with sarcasm, context-dependent meaning, and domain-specific language. Always manually review a sample of results, perhaps 50 or 100 records, to get a feel for your model's accuracy for your specific use case.
04 Dos and Don'ts for Data Collection
Technical Best Practices
Legal and Ethical Guidelines
Understanding Data Bias
Sample Bias: Reddit users skew younger, more technical, and more politically liberal than the general population. Different subreddits have distinct cultural norms and demographics. Always acknowledge these limitations explicitly and avoid overgeneralizing findings to broader populations.
Temporal Bias: Social media discussions can be heavily influenced by recent news events, viral posts, or trending topics that may not represent long-term sentiment. Collect data over longer time periods and identify potential external influences that might skew your results.
Best Practice: Present your findings as "insights from [specific communities]" rather than universal truths about entire demographics or markets.
05 When to Build vs. Buy
At this point, you might be wondering whether it's worth building your own data collection system or just paying for professional research. The answer depends on your specific needs, timeline, and resources. Custom data collection isn't always the right choice, but when it is, it can provide unique insights that aren't available anywhere else.
The key is understanding when the DIY approach makes strategic sense versus when you're better off investing in professional services or existing datasets.
DIY might make sense when:
- You need data on emerging topics where formal research doesn't exist
- You have basic technical comfort and time to learn
- The data collection scope is manageable (thousands, not millions of records)
- Your research questions are highly specific to a specific industry or niche
Consider Professional Help When:
- Legal compliance requirements are complex (healthcare, finance, etc.)
- You need real-time data processing or advanced statistical analysis
- The project timeline is tight and technical troubleshooting isn't feasible
- Data sources require sophisticated scraping techniques or specialized access
Hybrid Approach
Many projects benefit from DIY data collection combined with professional analysis. Collect the raw data using these techniques, then engage a data scientist to select and run models and interpret the results.
From Questions to Insights
Remember that question from the beginning: "Is that a trend or is it just me?" Custom data collection gives you a way to move beyond speculation and gut feelings to find real patterns in how people discuss and experience topics.
The approach outlined in this guide won't give you the statistical rigor of academic research, but it will help you identify signals in the noise. Whether you're trying to understand how customers really feel about a product category, track emerging concerns in your industry, or spot opportunities that haven't hit mainstream research yet, these techniques can provide insights that will allow you to write with authority.
The key is starting small and frequently checking your results. If one particular model is struggling to accurately interpret your data, consider choosing another to test. Most importantly, always be transparent about your methodology and limitations—good analysis acknowledges what it can and cannot prove.