8000 Tags · fusionshen/pai · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Tags: fusionshen/pai

Tags

hived-v0.1.0

Toggle hived-v0.1.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Initial HivedScheduler (microsoft#3283)

Authors:
@zhypku Scheduling Algorithm Core
@mzmssg Scheduling Algorithm Config Parser
@yqwang-ms User Experience/Interface, Scheduling Framework/Architecture, K8S Integration, and others

v0.14.0

Toggle v0.14.0's commit message
[Web Portal] fix port list bug (microsoft#3240)

v0.13.0

Toggle v0.13.0's commit message
Release v0.13.0

New Features
------------

* OpenPAI protocol:
  - Introduce [OpenPAI protocol](./docs/pai-job-protocol.yaml) and job submission v2
    ([microsoft#2260](microsoft#2260))
  - Add new job submission v2 plugin
    ([microsoft#2461](microsoft#2461))

* Web portal:
  - Add login page for guests
    ([microsoft#2544](microsoft#2544))
  - Add user home page
    ([microsoft#2614](microsoft#2614))
    - Job Status
    - My virtual clusters
    - Available GPU nodes (whole cluster)
    - My recent jobs
    ![home](docs/images/home.png)
  - Add new user management page
    ([microsoft#2726](microsoft#2726), [microsoft#2796](microsoft#2796))
  - User Management UX refactoring with new layout and themes
    ([microsoft#2726](microsoft#2726), [microsoft#2796](microsoft#2796))

Improvements
------------

* OpenPAI protocol:
  - Update example jobs in marketplace v2 for OpenPAI protocol
    ([microsoft#2827](microsoft#2827))

* Web portal:
  - Refine styles in job pages
    ([microsoft#2829](microsoft#2829), [microsoft#2856](microsoft#2856),
    [microsoft#2858](microsoft#2858), [microsoft#2862](microsoft#2862))
  - Refine alert message in job pages
    ([microsoft#2698](microsoft#2698))
  - Reduce the build bundle size to improve webportal performance
    ([microsoft#2715](microsoft#2715))

* Rest server:
  - Add job v1 config to v2 converter
    ([microsoft#2756](microsoft#2756))
  - Check default runtime before starting Docker
    ([microsoft#2754](microsoft#2754))

* Framework launcher:
  - Upgrade to Hadoop 2.9.0
    ([microsoft#2704](microsoft#2704))

* Job exporter:
  - Change triggering rule for exporter hangs
    ([microsoft#2766](microsoft#2766))
  - Add GPU temperature detection
    ([microsoft#2757](microsoft#2757))

* Watchdog:
  - Use `/api/v1/pods` to get all pods
    ([microsoft#2750](microsoft#2750))

* Deployement:
  - Allow user to use <kbd>Backspace</kbd> in `paictl` input
    ([microsoft#2769](microsoft#2769))
  - Disable InfiniBand driver installation by default
    ([microsoft#2595](microsoft#2595))

Documentation
-------------

* Refine document of VS Code extension
  ([microsoft#2707](microsoft#2707))
* Add document for PAI storage
  ([microsoft#2822](microsoft#2822))
* OpenPAI protocol specification document
  ([microsoft#2260](microsoft#2260))
* Job submission v2 plugin document
  ([microsoft#2820](microsoft#2820))
* Update RESTful API document for API v2
  ([microsoft#2816](microsoft#2816))
* Fix typos in document
  ([microsoft#2818](microsoft#2818))

Bug Fixes
---------

* Web portal:
  - Fix text broken when create or edit user
    ([microsoft#2849](microsoft#2849))
  - Fix token authentication bug
    ([microsoft#2843](microsoft#2843))
  - Fix retry count's margin-top
    ([microsoft#2845](microsoft#2845))
  - Fix job clone bug
    ([microsoft#2836](microsoft#2836))
  - Fix home page's responsive layout
    ([microsoft#2805](microsoft#2805))
  - Fix job list page filter bug
    ([microsoft#2787](microsoft#2787))
  - Fix home page failed to load virtual cluster list bug
    ([microsoft#2774](microsoft#2774))

* Rest server:
  - Check duplicate job in submission v2
    ([microsoft#2837](microsoft#2837))

* Hadoop:
  - Increase YARN kill container timeout
    ([microsoft#2778](microsoft#2778))
  - Remove cross origin in resource manager
    ([microsoft#2758](microsoft#2758))
  - Fix Haddoop AI matching nvidia-smi regex
    ([microsoft#2681](microsoft#2681))

Known Issues
------------

* Deployments issues on NVIDIA DGX2
  ([microsoft#2742](microsoft#2742))

v0.12.0

Toggle v0.12.0's commit message
Release v0.12.0

New Features
------------

* Web portal:
  - Display error message in job detail page
    [microsoft#2456](microsoft#2456)
  - Import users from CSV file directly and show the final results
    [microsoft#2495](microsoft#2495)
  - Add TotalGpuCount and TotalTaskCount into job list
    [microsoft#2499](microsoft#2499)
* Deployment
  - Add cluster version info
    [microsoft#2528](microsoft#2528)
  - Check if the nodes are ubuntu 16.04
    [microsoft#2520](microsoft#2520)
  - Check duplicate hostname
    [microsoft#2403](microsoft#2403)

Improvements
------------

* Web portal:
  - Replace the suffix if a cloned job is resubmited
    [microsoft#2451](microsoft#2451)
  - Refine view full log
    [microsoft#2431](microsoft#2431)
  - Job list: optimize filter
    [microsoft#2444](microsoft#2444)
  - Replace the url module with the querystring module
    [microsoft#1825](microsoft#1825)
* REST server:
  - Follow REST protocol in job create controller
    [microsoft#2481](microsoft#2481)
  - Add task state; Add job's retry details; Refine job config
    [microsoft#2306](microsoft#2306)
  - Remove error message
    [microsoft#2464](microsoft#2464)
* Framework Launcher:
  - Add more info into SummarizedFrameworkInfo
    [microsoft#2435](microsoft#2435)
* Alert manager:
  - Send resolved email and make user can config repeat interval
    [microsoft#2438](microsoft#2438)
  - Monitor process memory consumption and alert for `omiagent` and
    `omsagent` [microsoft#2419](microsoft#2419)

Documentation
-------------

- Doc refactoring and update hello-world sample
  [microsoft#2445](microsoft#2445)
- Add Chinese translation
  [microsoft#2344](microsoft#2344)

Bug Fixes
---------

* Web portal:
  - Add validation when submitting job by json
    [microsoft#2375](microsoft#2375)
  - Job List-filter UI fix
    [microsoft#2479](microsoft#2479)
  - Fix job detail "jobConfig is null" bug
    [microsoft#2500](microsoft#2500)
  - Fix job detail page's "retry link"
    [microsoft#2478](microsoft#2478)
  - Fix job v2 detail page rendering error
    [microsoft#2480](microsoft#2480)
* REST server:
  - code_dir_size report incorrect error message
    [microsoft#2388](microsoft#2388)
  - fix script entrypoint
    [microsoft#2522](microsoft#2522)
  - Fixed jq invocation errors with numeric taskRoles
    [microsoft#2405](microsoft#2405)
* Hadoop:
  - Remove duplicate diagnostics
    [microsoft#2527](microsoft#2527)
* Alart manager:
  - Fix alert label error
    [microsoft#2521](microsoft#2521)
* Drivers:
  - Add an optional configuration to skip ib drivers installation.
    [microsoft#2514](microsoft#2514)
  - Fix delete script of rollback nvidia runtime
    [microsoft#2370](microsoft#2370)
  - Fix driver parse [microsoft#2458](microsoft#2458)
* Storage plugin
  - Add environment and handle corner cases
    [microsoft#2525](microsoft#2525)

Known Issues
------------

N/A

Upgrading from Earlier Release
------------------------------

Please follow the [Upgrading to
v0.12](./docs/upgrade/upgrade_to_v0.12.md) for detailed instructions.

v0.11.0

Toggle v0.11.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
[WIP] 0.11.0 release notes (microsoft#2410)

* add release note

* refine layout

* refine layout

* refine layout

* replace images

* Revert "replace images"

This reverts commit 361afac.

* replace images

* typo

* Update RELEASE_NOTES.md

* Update RELEASE_NOTES.md

* Update RELEASE_NOTES.md

* Update RELEASE_NOTES.md

* fix

* remove tail #

* remove trailing space

* add know issue

* add retry link missing

* add nfs wiki link

v0.10.1

Toggle v0.10.1's commit message
WIP: Draft a v0.10.1 release note (microsoft#2156)

* Draft a v0.10.0 release note

* version

* fix H1

* Contrib packages

* Reverse list in order to merge time

* Fix

* Update RELEASE_NOTES.md

* Update RELEASE_NOTES.md

* Update RELEASE_NOTES.md

* Update RELEASE_NOTES.md

* Update RELEASE_NOTES.md

* Update RELEASE_NOTES.md

* Update RELEASE_NOTES.md

* Update RELEASE_NOTES.md

* Update RELEASE_NOTES.md

* Update RELEASE_NOTES.md

* Update RELEASE_NOTES.md

* Update RELEASE_NOTES.md

* Move upgrading instructions to separately doc

v0.10.0

Toggle v0.10.0's commit message
v0.10.0: Feb. 2019 Internal Release

v0.9.1

Toggle v0.9.1's commit message
v0.9.1: Feb. 2019 Release

Bug Fixes:

* REST Server: Fix admin permission, Closes microsoft#2172

v0.9.0

Toggle v0.9.0's commit message
delete job-exporter known issue

pai-vscode-0.1.0

Toggle pai-vscode-0.1.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
[VS Code] Readme update (microsoft#2108)

* readme update

* update screenshot picture
0