
Hari
Hari is a Python library designed to establish a standardized pattern for developing PySpark applications, leveraging the concept of "data contracts" to ensure consistency, reliability, and maintainability in data engineering workflows.
All application operations are based on the hari command. This command has a subcommand for each action the application can perform, such as create and contract.
How to Install
To install the CLI, it is recommended to use pipx:
pipx install hari-data
Although this is just a recommendation! You can also install the project using your preferred package manager, such as pip:
pip install hari-data
How to create a new Hari project?
You can create a new project via the command line. For example:
hari create project_name
This command create a directory project_name and print this message:
Directories and Files Created
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Type ┃ Name ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Directory │ project_name/configs │
│ Directory │ project_name/utils │
│ File │ project_name/configs/configs.yaml │
│ File │ project_name/utils/helpers.py │
│ File │ project_name/utils/validators.py │
│ File │ project_name/job.py │
│ File │ project_name/README.md │
└───────────┴───────────────────────────────────┘
Project project_name created successfully!
Happy coding! 🚀
How to create a new Data Contract?
Data contracts are one of the main features of the Hari library. To create a new contract, you must be inside a Hari project directory.
Run the following command:
hari contract new contract_name
You will be prompted for the following information:
- Description (optional): A description for your contract.
- Owner email (optional): The email of the contract owner.
- Output table name: Name of the output table (e.g., table_name, file_name).
- Output table format: Format of the output table (e.g., parquet, csv).
- Output table path: Path or URI for the output table.
- Columns: You will be able to add one or more columns, specifying name, type, nullability, uniqueness, and optionally size/precision.
- Partitions: Optionally, select columns to use as partitions.
- SLA: Optionally, add SLA details such as update frequency and tolerance.
Example session:
$ hari contract new sales_contract
Description of the contract (optional): Sales data contract
Email of the contract owner (optional): user@email.com
Name of the output table (e.g., table_name, file_name): sales.daily
Format of the output table (e.g., parquet, csv): parquet
Path of the output table (e.g., uri_catalog, local_path): /data/sales
Do you want to add a column? [Y/n]: Y
Column name: sale_id
Column type (e.g., string, int, double): int
Can the column be null? [Y/n]: n
Are the column values unique? [y/N]: y
# ...repeat for more columns...
Do you want to add partition columns? [y/N]: y
Choose a column for partitioning (or press Enter to finish): sale_id
Do you want to add another partition column? [y/N]: n
Do you want to add SLA details? [y/N]: y
Frequency of updates (e.g., daily, weekly, monthly): daily
Tolerance for SLA (e.g., 1 hour, 30 minutes): 1 hour
Proceeding to create the contract...
Saving contract data to YAML file...
Contract sales_contract created successfully!
Happy coding! 🚀
Example session with interactive prompts:
$ hari contract new sales_contract
Description of the contract (optional): Sales data contract
Email of the contract owner (optional): user@email.com
Name of the output table (e.g., table_name, file_name): sales.daily
Format of the output table (e.g., parquet, csv): parquet
Path of the output table (e.g., uri_catalog, local_path): /data/sales
Do you want to add a column? [Y/n]: Y
Column name: sale_id
Column type (e.g., string, int, double): int
Can the column be null? [Y/n]: n
Are the column values unique? [y/N]: y
Do you want to add a column? [y/N]: y
Column name: sale_date
Column type (e.g., string, int, double): date
Can the column be null? [Y/n]: n
Are the column values unique? [y/N]: n
Do you want to add partition columns? [y/N]: y
Choose a column for partitioning (or press Enter to finish): sale_date
Do you want to add another partition column? [y/N]: n
Do you want to add SLA details? [y/N]: y
Frequency of updates (e.g., daily, weekly, monthly): daily
Tolerance for SLA (e.g., 1 hour, 30 minutes): 1 hour
Proceeding to create the contract...
Saving contract data to YAML file...
Contract sales_contract created successfully!
Happy coding! 🚀
You can also use command-line options to provide values directly, skipping the interactive prompts. For example:
hari contract new sales_contract \
--description "Sales data contract" \
--owner-email "user@email.com" \
--output-table-name "sales.daily" \
--output-table-format parquet \
--output-table-path "/data/sales"
If you provide all required options, the command will not prompt for those values interactively. You will still be prompted for columns, partitions, and SLA unless you provide those through additional options (if supported).
The contract will be saved as a YAML file in the contracts directory of your project.
Example of a generated contract YAML file:
version: 1.0.0
creation_date: '2025-08-03'
name: sales_contract
description: Sales data contract
owner_email: user@email.com
output_table:
name: sales.daily
format: parquet
path: /data/sales
partitioned_by:
- sale_date
columns:
- name: sale_id
type: int
is_nullable: false
is_unique: true
- name: sale_date
type: date
is_nullable: false
is_unique: false
sla:
frequency: daily
tolerance: 1 hour
Is it possible to have more than one data contract per project?
Yes. The idea is that you create one data contract for each output your project will generate.
Is it possible to have data contracts for inputs?
Yes. However, I recommend evaluating whether it is really worth creating contracts for inputs. Prefer to create them in situations of great complexity where an unexpected change in the inputs would be detrimental to the process.
Can I add more parameters to the contract after it is created?
Yes. Unfortunately, this has not yet been implemented via CLI. But you can edit the file content manually. To avoid incompatibility with new features that will be released, I recommend keeping at least the standard parameters, but feel free to add whatever you find necessary.
More information about Hari
To discover other options, you can use the --help flag
hari --help
Usage: hari [OPTIONS] COMMAND [ARGS]...
╭─ Commands ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ create Create a new project. │
│ contract Manage data contracts. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯