Contribute
Thank you for your interest in contributing to the hari project . This document lists the most common operations you may need to contribute.
How does the project work?
Project Structure
flowchart
. --> docs
. --> hari
. --> tests
The project is divided into three directories: docs, hari, and tests. Each directory has its specific function.
hari
flowchart
. --> hari --> cli
hari --> cli
hari --> session
cli --> commands
cli --> templates
templates --> templates.py
commands --> project.py
commands --> contract.py
commands --> cli.py
session --> hari_spark_session_manager.py
session --> hari --> hari_spark_session_templates.py
The CLI code and the library are in hari. The API documentation is also being written in hari, using mkdocstrings and following the Google docstrings standard. So, if you change anything in the code, remember to update the docstrings as well.
Some examples used in the docstrings are also used for tests. If you change the output format, remember to update the docstrings.
The templates directory contains base files for creating the project structure pattern. The commands directory contains the actual code that builds the CLI.
The sessions directory contains the code for managing Spark sessions. If you want to add new functionalities to the Spark session manager, you can do so in the hari_spark_session_manager.py file.
The session/hari directory contains templates for creating Spark sessions. If you want to add new templates, you can do so in the hari_spark_session_templates.py file e. g., if you want to add a new template for a Spark session with Delta Lake support, you can add it there.
About the library
At this moment, the library uses pure Python, with no external dependencies. This is intentional, as the code is quite simple. Function responses are standardized to always return a Python dictionary, since someone may want to expand this to a graphical interface or use it in a REST API. Having a serializable standard can help a lot.
About data contracts
The CLI
The CLI was built using the Typer library. You can check its documentation for more details if you want to expand CLI functionalities.
For rich outputs in the application, the Rich library is used. If you want to change anything related to the tables generated in the output, you can go directly to the documentation page for tables.
The only convention being followed regarding the CLI is that a Console object from Rich and a Typer app have already been defined. It would be best to continue using these objects.
from rich.console import Console
from typer import Argument, Typer
...
console = Console()
app = Typer()
tests
For testing, we use pytest. Its configuration can be found in the pyproject.toml file at the root of the project.
Important things to know about the tests: not all tests are only in the hari_data/tests directory. The addopts = "--doctest-modules" flag is used. So, if you modify something, be aware that docstrings also run tests and are the basis for API documentation, so be careful with changes.
If you want to skip a test, just use the flag # doctest: +SKIP.
Test coverage is automatically generated with pytest-cov and is displayed when the test task is executed:
task tests
Linters are also required for these tests.
Documentation
All documentation is based on mkdocs with the mkdocs-material theme.
flowchart
. --> docs
. --> mkdocs.yml
docs --> files.md
docs --> api
docs --> assets
docs --> templates
docs --> stylesheets
All configuration can be found in the mkdocs.yml file at the root of the repository.
Various tools are also used to complement the documentation, such as jinja templates where instructions may repeat. If you find blocks like:
{ % % }
You will know it's a template.
Templates are defined in the /docs/templates directory. In some cases, however, they may be called by variables like command.run, which appears in almost every documentation file. These macros are made with mkdocs-macros and are defined in the mkdocs configuration file:
extra:
commands:
run: poetry run hari
API Documentation
API documentation is written inside the code modules. That's why files in the docs/api directory have a tag:
::: module
This means the code contained in the docstrings will be used in this block. The mkdocstrings plugin is used for this.
Documentation in the modules follows the Google docstrings format, which is the library standard.
Tools
This project basically uses two main tools for all control:
- Poetry: For environment management and library installation
- Taskipy: For automating routine tasks, such as running tests, linters, documentation, etc.
So, make sure you have poetry installed for your contribution:
pipx install poetry
Steps to run specific tasks
Here are commands you can use to perform routine tasks, such as cloning the repository, installing dependencies, running tests, etc.
How to clone the repository
git clone https://github.com/julioszeferino/hari.git
How to install dependencies
poetry install
How to run the CLI
poetry run hari [subcommand]
How to run code checks
task lint
How to run tests
task test
How to run the documentation
task docs
Tasks you can contribute to
Tasks for Contribution
These are tasks we know need to be done for general system improvements
- Implement Spark session structuring mechanism: reference
For tasks not mapped here, you can check the project issues
Didn't find what you need here?
If you didn't find what you need, you can open an issue in the project describing what you can't do or what needs better documentation.
Continuous improvement
This document can be improved by anyone interested in making it better. So, feel free to provide more tips for people who want to contribute as well
Acknowledgements
A huge thank you to dear Eduardo Mendes(@dunossauro) for taking the time to share with the community how to develop a Python library. His video inspired the development of this project.