Create a clear, ordered plan to clean and preprocess a CSV dataset.
Context to use:
- Columns and data types:
<csv-columns>
Csv Columns
</csv-columns>
- Goal/use-case: Data Goal
- Known issues:
<data-issues>
Data Issues
</data-issues>
• Use only the provided columns; do not invent fields.
• If details are missing, state an assumption before proceeding.
• No code; provide actions and criteria only.
• Keep steps atomic, actionable, and tailored to the goal.
Output format (use numbered steps):
1) Objective (one sentence)
2) Assumptions (bulleted)
3) Column-specific plan: for each column, list
- expected type and validation (ranges/sets/regex),
- parsing/casting rules (e.g., date formats, locale),
- missing-value policy, default/fill logic,
- normalization/standardization if needed,
- special handling (e.g., currency, IDs, PII masking).
4) Dataset-wide steps (ordered):
- schema validation and type enforcement,
- duplicate detection/removal logic,
- whitespace/case trimming and value canonicalization,
- categorical level unification and rare-level handling,
- outlier detection and treatment strategy,
- date/time normalization (timezone, invalid dates),
- consistency checks across fields (e.g., totals, keys),
- leakage protections or feature-target separation if modeling,
- train/validation/test split rules if applicable,
- final export rules (sorting, column order, formats).
5) Validation checklist (bulleted): concrete checks and thresholds to confirm cleaning worked.
6) Final deliverables: what artifacts/datasets will exist after cleaning.
<example>
Columns and types example:
- id: integer (unique key)
- order_date: date (YYYY-MM-DD)
- email: string
- country: categorical {US,CA,GB,...}
- amount_usd: float (>= 0; refunds recorded separately)
</example>