
Data Scientist, NLP — Text Homogeneity & Anomaly Detection
Upwork
Remoto
•2 hours ago
•No application
About
Looking for a skilled data scientist with NLP experience to analyze a confidential CSV containing ~2,000 short text entries. The goal is to determine whether there are unusual patterns, for example unusual repetition, short/generic wording, or timing anomalies - looking for unnatural records within what should be a natural set of records. Responsibilities - Clean and preprocess the dataset, handle unicode and language noise. - Run exploratory analysis, including length, lexical diversity, and sentiment proxies. - Create embeddings and run clustering to surface homogeneous groups. - Apply anomaly detection and temporal analysis to spot suspicious bursts. - Produce visualizations that clearly explain findings to a non-technical audience. - Deliver a reproducible Jupyter notebook or Python scripts, plus a short written summary. Required skills - Strong Python, pandas, scikit-learn experience. - Practical NLP experience with spaCy, HuggingFace, or similar. - Familiarity with embeddings, clustering (DBSCAN, k-means), and anomaly detection (Isolation Forest, LOF). - Experience creating clear charts and concise writeups. - Good communicator, able to explain methods and limitations. Nice to have - Prior work on detecting repetitive or coordinated text. - Stylometry or forensic linguistics exposure. - Experience comparing datasets to public fake-review benchmarks. Deliverables (suggested) - Jupyter notebook or scripts with commented code. - Visuals: rating distribution, text length boxplot, cluster map, timeline of suspicious activity. - Short report summarizing methodology, key signals, and a reasoned likelihood estimate of manipulation. - Brief recommendations for next steps. Timeline - Estimated scope 15–25 hours.