Text Analysis & GenerationProductivityData Science

Data Cleaning Plan from CSV Columns

Produce an ordered, code-free CSV cleaning plan from your columns, goals, and issues, including column-specific rules, dataset-wide steps, explicit assumptions, validation checks, and deliverables.

Prompt Content

Create a clear, ordered plan to clean and preprocess a CSV dataset. Context to use: - Columns and data types: <csv-columns> Csv Columns </csv-columns> - Goal/use-case: Data Goal - Known issues: <data-issues> Data Issues </data-issues> • Use only the provided columns; do not invent fields. • If details are missing, state an assumption before proceeding. • No code; provide actions and criteria only. • Keep steps atomic, actionable, and tailored to the goal. Output format (use numbered steps): 1) Objective (one sentence) 2) Assumptions (bulleted) 3) Column-specific plan: for each column, list - expected type and validation (ranges/sets/regex), - parsing/casting rules (e.g., date formats, locale), - missing-value policy, default/fill logic, - normalization/standardization if needed, - special handling (e.g., currency, IDs, PII masking). 4) Dataset-wide steps (ordered): - schema validation and type enforcement, - duplicate detection/removal logic, - whitespace/case trimming and value canonicalization, - categorical level unification and rare-level handling, - outlier detection and treatment strategy, - date/time normalization (timezone, invalid dates), - consistency checks across fields (e.g., totals, keys), - leakage protections or feature-target separation if modeling, - train/validation/test split rules if applicable, - final export rules (sorting, column order, formats). 5) Validation checklist (bulleted): concrete checks and thresholds to confirm cleaning worked. 6) Final deliverables: what artifacts/datasets will exist after cleaning. <example> Columns and types example: - id: integer (unique key) - order_date: date (YYYY-MM-DD) - email: string - country: categorical {US,CA,GB,...} - amount_usd: float (>= 0; refunds recorded separately) </example>

Variables

Csv Columns: List the CSV columns with data types, formats, allowed values, and any constraints or notes.; Example: customer_id: integer (unique); signup_date: date (YYYY-MM-DD); plan: categorical {free,basic,pro}; mrr: float (USD); is_active: boolean
Data Goal: Primary use of the cleaned data (analysis/reporting/modeling) and key success criteria.; Example: Build a churn classification model; features must be leakage-safe and standardized
Data Issues: Known or suspected data quality issues to address.; Example: Missing signup_date for ~5%; duplicate customer_id rows; mrr contains negative values for credits; mixed country codes (US vs USA)

Prompt Content

Variables

Csv Columns

List the CSV columns with data types, formats, allowed values, and any constraints or notes.

Example: customer_id: integer (unique); signup_date: date (YYYY-MM-DD); plan: categorical {free,basic,pro}; mrr: float (USD); is_active: boolean

Data Goal

Primary use of the cleaned data (analysis/reporting/modeling) and key success criteria.

Example: Build a churn classification model; features must be leakage-safe and standardized

Data Issues

Known or suspected data quality issues to address.

Example: Missing signup_date for ~5%; duplicate customer_id rows; mrr contains negative values for credits; mixed country codes (US vs USA)