AI / Data Engineer Community Insights Dataset (Reddit + Forums

AI / Data Engineer Community Insights Dataset (Reddit + Forums

AI / Data Engineer Community Insights Dataset (Reddit + Forums

Upwork

Upwork

Remoto

1 month ago

No application

About

We’re building a dataset and lightweight model for community discussion signals across global fashion and footwear spaces. The aim: extract real-world patterns around fit, comfort, material quality, purchase behavior, and authenticity chatter from open online discussions then prepare this data for downstream modeling and insights. We’re looking for an experienced AI/data engineer capable of turning unstructured community text into structured, model-ready intelligence however you choose Phase 1 Deliverables (3–4 Weeks) 1️⃣ Data Collection Pipeline • Collect public fashion/footwear threads (Reddit + major forums) via official APIs or compliant archives. • Deliver reproducible, rate-limited Python scripts with logging, retries, and clear README. • Output: JSON/NDJSON with post text, timestamp, thread title, anonymised ID, and metadata. 2️⃣ Data Cleaning & De-duplication • Strip PII and spam, merge near-duplicates, normalize timestamps and encodings. • Output: validated dataset with consistent schema and unique IDs. 3️⃣ Annotation Schema + Mini Labeled Set • Propose concise taxonomy (e.g. comfort | sizing | material | authenticity | purchase context | sentiment). • Label 2 000 – 5 000 examples or design a labeling workflow with quality metrics (Cohen’s κ or α). 4️⃣ Semantic Clustering / Topic Modeling • Use embeddings or LDA to expose top themes and frequent entities (brands, SKUs, models). • Provide brief cluster summaries and keyword clouds. 5️⃣ Prototype Classifier (optional) • Deliver a prompt-based or light-fine-tuned transformer that maps posts to schema tags. • Include small evaluation set + metrics (precision/recall/F1). 6️⃣ Documentation & Handover • Clean repo, labeling guide, and a 30-45 min walkthrough call. ⸻ Legal & Ethical Guardrails • Public-only data (no private groups or paywalls). • Remove or hash any identifiers. • Summarise ToS compliance approach. • NDA required before delivery. ⸻ 🔧 Skills We Expect (Expert-Level) Data Engineering • Advanced Python (async requests, structured logging, fault-tolerant retry logic). • Reddit API / Pushshift / compliant data interfaces. • Modular ETL ( Prefect / Dagster / Airflow ) with reproducible outputs (Parquet / Arrow / NDJSON). • Docker + Git for reproducible environments. NLP / ML • Text cleaning (spaCy / regex / transformers). • Semantic embeddings + clustering (FAISS, HDBSCAN, BERTopic). • Prompt engineering and few-shot classification (Hugging Face). • Feature extraction and entity linking (brands, SKUs, materials). Labeling & Quality • Experience with Label Studio / Prodigy / Snorkel. • Designing schema + label agreement metrics. Documentation & Delivery • Clean README, well-commented code, and short insight report. Nice to Have • Consumer or e-commerce data background. • Streamlit / Metabase dashboards for signal visualization. • Familiarity with privacy-preserving PII redaction workflows. ⸻ Evaluation Criteria Please include: 1. Your Phase 1 plan (≤ 300 words) + ToS compliance notes. 2. Links to GitHub / repos or relevant projects. 3. Timeline and milestone pricing. 4. Quick answers: • How would you responsibly collect Reddit data today? • One technique you use to keep labels consistent. • Can you sign an NDA and deliver within 3–4 weeks? ⸻ Why It’s Interesting • You’ll work with real, high-signal community data, not synthetic samples. • Fast, visible outcomes no academic drag. • Long-term path into product data ops or model deployment. Apply with: plan + portfolio + availability + rate. Subject line: RED: fashion-data