Regret Analysis | Preliminary Results

Project Overview

People express regret after high-stakes life decisions: career changes, immigration moves, relationship choices. This project collects regret-related posts from Reddit, engineers structured features (sentiment scores, reversal indicators, time-to-regret), and uses statistical and machine-learning methods to understand what predicts decision reversal.

Core Question What factors (linguistic sentiment, decision domain, or elapsed time) predict whether someone reporting regret will also report having reversed their decision?

Dataset

Data is collected from Reddit's public JSON API across nine subreddits in three life domains. Posts are filtered for explicit regret language (e.g., "I regret", "I wish I had", "my biggest mistake") and screened to exclude hypothetical or future-oriented regret.

Property	Value
Raw posts collected	~21,800 across nine subreddits
Filtered regret posts	3,604 (structured analysis subset)
Domains	Career (1,998) - Immigration (297) - Relationships (1,309)
Subreddits	r/cscareerquestions, r/careerguidance, r/jobs, r/careeradvice, r/USCIS, r/IWantOut, r/immigration, r/relationship_advice, r/relationships
Engineered features	vader_compound, vader_neg, urgency_score, reversal, time_to_regret_days, topic, emotion, event_type, agency_score, hedging_score, social_embed_score, causal_reasoning_score, future_orient_score, comment_count_log, score_log, sentence embeddings (PCA-10)

The broader collection pipeline scraped ~21.8k raw posts. After regret filtering, deduplication, and structured feature extraction, the preliminary analysis uses 3,604 confirmed regret posts enriched with VADER sentiment and improved temporal extraction (47.2% time coverage, up from 10.7%).

Methods

Collection & Filtering: Reddit API scraping followed by regex-based regret extraction with exclusion of hypothetical language.
Feature Engineering: VADER sentiment analysis on regret sentences; urgency scores from lexical cues; reversal labels from action verbs; time-to-regret from enhanced temporal extraction (regex + dateparser patterns); sentence embeddings (all-MiniLM-L6-v2, PCA-50); 7-class emotion classification (GoEmotions distilRoBERTa); triggering event taxonomy (per-domain keyword-based event type labeling); psycholinguistic features (agency, hedging, social embeddedness, causal reasoning, future orientation); community engagement metrics (log-comments, log-score, upvote ratio).
Causal Inference: Propensity score matching (1:1 nearest-neighbor, career vs immigration); cohort/temporal analysis with logistic regression (year x domain interaction); Kaplan-Meier by era.
Label Validation: Comment-thread scraping (n=700) for reversal confirmation; keyword vs comment-based label agreement (Cohen's kappa); LLM-based annotation pipeline (script ready, requires API key).
Modeling: Cross-domain reversal classification (LR/RF/GBM, stratified 5-fold CV, bootstrap CIs); leave-one-domain-out generalization; NMF topic modeling; Cox PH regression with 13 covariates; nested logistic models for subreddit confounding.
Statistical Testing: Chi-square tests; Mann-Whitney U with effect sizes; Kruskal-Wallis H; log-rank tests; likelihood ratio tests for model comparison; log-odds ratio linguistic analysis (Monroe et al. 2008); bootstrap CIs on all key metrics.

View Code on GitHub

Interactive Data Explorer

Explore key patterns across domains, emotions, and event types. Select a view from the dropdown and hover over data points for details.

View:

Preliminary Results

Result 1

Reversal rates differ significantly across life domains

Question Do regret narratives lead to decision reversal at different rates across career, immigration, and relationships?

Finding Relationships show the highest reversal rate (39.5%), followed closely by career (38.5%). Immigration is notably lower at 27.6%, despite strong regret language. This difference is statistically significant (chi-square = 14.57, p < 0.001).

Key insight

Structural constraints (e.g., immigration systems) limit the ability to act on regret, even when emotional intensity is high. Immigration combines strong regret expression with the lowest reversal rate, a pattern consistent with barriers to undoing a move or visa path.

38.5%

Career reversal

95% CI [36.4%, 40.6%]

27.6%

Immigration reversal

95% CI [22.6%, 32.9%]

39.5%

Relationships reversal

95% CI [37.0%, 42.3%]

Evidence

Reversal Rate by Domain

Regret Posts by Domain (n = 3,604)

Model Performance

Model	CV AUC (5-fold)	Test AUC	95% CI
Logistic Regression	0.657 +/- 0.017	0.641	[0.598, 0.680]
Random Forest	0.675 +/- 0.024	0.652	[0.610, 0.691]
Gradient Boosting	0.653 +/- 0.018	0.631	[0.589, 0.671]

The best-performing model (Random Forest) achieves AUC 0.652 [0.610, 0.691], indicating that text and domain features capture moderate but limited signal for predicting reversal. Top features include has_time, left, vader_compound, and vader_neg.

Precision-Recall Curve (Random Forest)

Top 20 Feature Importances

The feature importance plot reveals that whether a post mentions a specific time frame (has_time) is the strongest single predictor of reversal, followed by action words like "left" and sentiment features (vader_compound, vader_neg). Domain membership also contributes, with immigration as a distinct category.

Next steps

Planned follow-ups

Expand immigration corpus (r/greencard, r/h1b) to increase statistical power for immigration sub-analyses
Develop a fine-grained immigration sub-classifier: USCIS (bureaucratic) vs IWantOut (voluntary) vs asylum posts likely have different reversal dynamics
Test event type as a moderator of the domain-reversal relationship in a formal mediation model

Result 2

Sentiment, not urgency keywords, captures the emotional signal in regret

Question Does the emotional tone of regret language predict whether regret leads to reversal?

Finding Simple urgency keyword counts (mean 0.17) show almost no correlation with reversal (r = 0.05). VADER sentiment analysis finds a statistically significant difference: reversal posts are more negative (mean compound = -0.127) than non-reversal posts (mean = -0.085), Mann-Whitney p = 0.003.