Skip to content

Distributed training with Kubernetes #17

@jlewi

Description

@jlewi

Opening this issue to start a discussion about whether it would be worth investing to make it easy to run tensorflow agents K8s.

For some inspiration you can look at TfJob CRD.

Some questions:

  1. Is there a need to be able to distribute the environments across multiple machines?
  2. What is the communication pattern between the simulations and TensorFlow job?
    * Is data fetched from all simulations simultaneously?
    * Does each simulation need to be individually addressable?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions