Skip to content

Tags: instructlab/training

Tags

v0.12.1

Toggle v0.12.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix(torchrun): Omit empty arguments and correct nproc_per_node type (#…

…661)

* fix(torchrun): Omit empty arguments and correct nproc_per_node type

The command generation logic is updated to dynamically
build the torchrun command, excluding arguments that
are empty or None. This prevents them from overriding
environment variables, ensuring that torchrun can
correctly inherit its configuration. An exception is
made for integer arguments where 0 is a valid value.

Additionally, the nproc_per_node argument type has been
changed from int to str to support special values
accepted by PyTorch, such as 'auto', 'gpu', and 'cpu'.

Reference: https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py#L77-L88

Signed-off-by: Saad Zaher <szaher@redhat.com>

* only dynamically add torchrun args & change rdzv_id type to str

Signed-off-by: Saad Zaher <szaher@redhat.com>

* fix smoke tests

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Enable both dtypes str, int for nproc_per_node, rdzv_id

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Use python3.11 style for pydatnic model

Signed-off-by: Saad Zaher <szaher@redhat.com>

* add all torchrun args and validate them

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Remove non-required dependencies

Signed-off-by: Saad Zaher <szaher@redhat.com>

* update datatypes only

Signed-off-by: Saad Zaher <szaher@redhat.com>

* replace _ with - when passing torchrun args

Signed-off-by: Saad Zaher <szaher@redhat.com>

* make nproc_per_node to only accept gpu or int

Signed-off-by: Saad Zaher <szaher@redhat.com>

* add master_{addr, port} validate args

Signed-off-by: Saad Zaher <szaher@redhat.com>

* check for not set or empty rdzv endpoint

Signed-off-by: Saad Zaher <szaher@redhat.com>

* fix formatting error

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Update src/instructlab/training/config.py

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Update tests/smoke/test_train.py

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Update src/instructlab/training/main_ds.py

Signed-off-by: Saad Zaher <szaher@redhat.com>

* fixes indentation

Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>

* formatting

* add standalone as the fallback when neither master_addr nor rdzv_endpoint are provided

Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>

* clarify rdzv-backend arg

---------

Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
Co-authored-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>

v0.12.0

Toggle v0.12.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Add kernels>0.9.0 to CUDA requirements (#658)

Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>

v0.11.1

Toggle v0.11.1's commit message
Fix isort errors

v0.10.4

Toggle v0.10.4's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #634 from instructlab/mergify/bp/release-v0.10/pr-628

uncap accelerate in `requirements-cuda.txt` (backport #628)

v0.10.3

Toggle v0.10.3's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #546 from instructlab/mergify/bp/release-v0.10/pr-455

moves deepspeed requirements into their own file; add deepspeed extras (backport #455)

v0.11

Toggle v0.11's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #528 from fynnsu/pylint-unused-argument

Enable pylint 'unused-argument' check

v0.10.2

Toggle v0.10.2's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #518 from instructlab/mergify/bp/release-v0.10/pr-517

deps: Remove caps on ROCm dependencies (backport #517)

v0.10.1

Toggle v0.10.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #489 from instructlab/mergify/bp/release-v0.10/pr-488

Change default internal sharding strategy to HYBRID_SHARD (backport #488)

v0.10.0

Toggle v0.10.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #451 from instructlab/dependabot/github_actions/aw…

…s-actions/configure-aws-credentials-4.1.0

build(deps): Bump aws-actions/configure-aws-credentials from 4.0.2 to 4.1.0

v0.9.0

Toggle v0.9.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #453 from JamesKunstle/rename-testing-dirs

change pytest targets. `test-unit` and `test-smoke` to `unit` and `smoke`