Tags: liuzurang/pai
Tags
Initial HivedScheduler (microsoft#3283) Authors: @zhypku Scheduling Algorithm Core @mzmssg Scheduling Algorithm Config Parser @yqwang-ms User Experience/Interface, Scheduling Framework/Architecture, K8S Integration, and others
Release v0.13.0
New Features
------------
* OpenPAI protocol:
- Introduce [OpenPAI protocol](./docs/pai-job-protocol.yaml) and job submission v2
([microsoft#2260](microsoft#2260))
- Add new job submission v2 plugin
([microsoft#2461](microsoft#2461))
* Web portal:
- Add login page for guests
([microsoft#2544](microsoft#2544))
- Add user home page
([microsoft#2614](microsoft#2614))
- Job Status
- My virtual clusters
- Available GPU nodes (whole cluster)
- My recent jobs

- Add new user management page
([microsoft#2726](microsoft#2726), [microsoft#2796](microsoft#2796))
- User Management UX refactoring with new layout and themes
([microsoft#2726](microsoft#2726), [microsoft#2796](microsoft#2796))
Improvements
------------
* OpenPAI protocol:
- Update example jobs in marketplace v2 for OpenPAI protocol
([microsoft#2827](microsoft#2827))
* Web portal:
- Refine styles in job pages
([microsoft#2829](microsoft#2829), [microsoft#2856](microsoft#2856),
[microsoft#2858](microsoft#2858), [microsoft#2862](microsoft#2862))
- Refine alert message in job pages
([microsoft#2698](microsoft#2698))
- Reduce the build bundle size to improve webportal performance
([microsoft#2715](microsoft#2715))
* Rest server:
- Add job v1 config to v2 converter
([microsoft#2756](microsoft#2756))
- Check default runtime before starting Docker
([microsoft#2754](microsoft#2754))
* Framework launcher:
- Upgrade to Hadoop 2.9.0
([microsoft#2704](microsoft#2704))
* Job exporter:
- Change triggering rule for exporter hangs
([microsoft#2766](microsoft#2766))
- Add GPU temperature detection
([microsoft#2757](microsoft#2757))
* Watchdog:
- Use `/api/v1/pods` to get all pods
([microsoft#2750](microsoft#2750))
* Deployement:
- Allow user to use <kbd>Backspace</kbd> in `paictl` input
([microsoft#2769](microsoft#2769))
- Disable InfiniBand driver installation by default
([microsoft#2595](microsoft#2595))
Documentation
-------------
* Refine document of VS Code extension
([microsoft#2707](microsoft#2707))
* Add document for PAI storage
([microsoft#2822](microsoft#2822))
* OpenPAI protocol specification document
([microsoft#2260](microsoft#2260))
* Job submission v2 plugin document
([microsoft#2820](microsoft#2820))
* Update RESTful API document for API v2
([microsoft#2816](microsoft#2816))
* Fix typos in document
([microsoft#2818](microsoft#2818))
Bug Fixes
---------
* Web portal:
- Fix text broken when create or edit user
([microsoft#2849](microsoft#2849))
- Fix token authentication bug
([microsoft#2843](microsoft#2843))
- Fix retry count's margin-top
([microsoft#2845](microsoft#2845))
- Fix job clone bug
([microsoft#2836](microsoft#2836))
- Fix home page's responsive layout
([microsoft#2805](microsoft#2805))
- Fix job list page filter bug
([microsoft#2787](microsoft#2787))
- Fix home page failed to load virtual cluster list bug
([microsoft#2774](microsoft#2774))
* Rest server:
- Check duplicate job in submission v2
([microsoft#2837](microsoft#2837))
* Hadoop:
- Increase YARN kill container timeout
([microsoft#2778](microsoft#2778))
- Remove cross origin in resource manager
([microsoft#2758](microsoft#2758))
- Fix Haddoop AI matching nvidia-smi regex
([microsoft#2681](microsoft#2681))
Known Issues
------------
* Deployments issues on NVIDIA DGX2
([microsoft#2742](microsoft#2742))
Release v0.12.0
New Features
------------
* Web portal:
- Display error message in job detail page
[microsoft#2456](microsoft#2456)
- Import users from CSV file directly and show the final results
[microsoft#2495](microsoft#2495)
- Add TotalGpuCount and TotalTaskCount into job list
[microsoft#2499](microsoft#2499)
* Deployment
- Add cluster version info
[microsoft#2528](microsoft#2528)
- Check if the nodes are ubuntu 16.04
[microsoft#2520](microsoft#2520)
- Check duplicate hostname
[microsoft#2403](microsoft#2403)
Improvements
------------
* Web portal:
- Replace the suffix if a cloned job is resubmited
[microsoft#2451](microsoft#2451)
- Refine view full log
[microsoft#2431](microsoft#2431)
- Job list: optimize filter
[microsoft#2444](microsoft#2444)
- Replace the url module with the querystring module
[microsoft#1825](microsoft#1825)
* REST server:
- Follow REST protocol in job create controller
[microsoft#2481](microsoft#2481)
- Add task state; Add job's retry details; Refine job config
[microsoft#2306](microsoft#2306)
- Remove error message
[microsoft#2464](microsoft#2464)
* Framework Launcher:
- Add more info into SummarizedFrameworkInfo
[microsoft#2435](microsoft#2435)
* Alert manager:
- Send resolved email and make user can config repeat interval
[microsoft#2438](microsoft#2438)
- Monitor process memory consumption and alert for `omiagent` and
`omsagent` [microsoft#2419](microsoft#2419)
Documentation
-------------
- Doc refactoring and update hello-world sample
[microsoft#2445](microsoft#2445)
- Add Chinese translation
[microsoft#2344](microsoft#2344)
Bug Fixes
---------
* Web portal:
- Add validation when submitting job by json
[microsoft#2375](microsoft#2375)
- Job List-filter UI fix
[microsoft#2479](microsoft#2479)
- Fix job detail "jobConfig is null" bug
[microsoft#2500](microsoft#2500)
- Fix job detail page's "retry link"
[microsoft#2478](microsoft#2478)
- Fix job v2 detail page rendering error
[microsoft#2480](microsoft#2480)
* REST server:
- code_dir_size report incorrect error message
[microsoft#2388](microsoft#2388)
- fix script entrypoint
[microsoft#2522](microsoft#2522)
- Fixed jq invocation errors with numeric taskRoles
[microsoft#2405](microsoft#2405)
* Hadoop:
- Remove duplicate diagnostics
[microsoft#2527](microsoft#2527)
* Alart manager:
- Fix alert label error
[microsoft#2521](microsoft#2521)
* Drivers:
- Add an optional configuration to skip ib drivers installation.
[microsoft#2514](microsoft#2514)
- Fix delete script of rollback nvidia runtime
[microsoft#2370](microsoft#2370)
- Fix driver parse [microsoft#2458](microsoft#2458)
* Storage plugin
- Add environment and handle corner cases
[microsoft#2525](microsoft#2525)
Known Issues
------------
N/A
Upgrading from Earlier Release
------------------------------
Please follow the [Upgrading to
v0.12](./docs/upgrade/upgrade_to_v0.12.md) for detailed instructions.
[WIP] 0.11.0 release notes (microsoft#2410) * add release note * refine layout * refine layout * refine layout * replace images * Revert "replace images" This reverts commit 361afac. * replace images * typo * Update RELEASE_NOTES.md * Update RELEASE_NOTES.md * Update RELEASE_NOTES.md * Update RELEASE_NOTES.md * fix * remove tail # * remove trailing space * add know issue * add retry link missing * add nfs wiki link
WIP: Draft a v0.10.1 release note (microsoft#2156) * Draft a v0.10.0 release note * version * fix H1 * Contrib packages * Reverse list in order to merge time * Fix * Update RELEASE_NOTES.md * Update RELEASE_NOTES.md * Update RELEASE_NOTES.md * Update RELEASE_NOTES.md * Update RELEASE_NOTES.md * Update RELEASE_NOTES.md * Update RELEASE_NOTES.md * Update RELEASE_NOTES.md * Update RELEASE_NOTES.md * Update RELEASE_NOTES.md * Update RELEASE_NOTES.md * Update RELEASE_NOTES.md * Move upgrading instructions to separately doc
v0.9.1: Feb. 2019 Release Bug Fixes: * REST Server: Fix admin permission, Closes microsoft#2172
[VS Code] Readme update (microsoft#2108) * readme update * update screenshot picture
PreviousNext