8000 [Q] How to download a lot of histories? · Issue #7778 · wandb/wandb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[Q] How to download a lot of histories? #7778

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and 8000 privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mbacvanski opened this issue Jun 9, 2024 · 4 comments
Open

[Q] How to download a lot of histories? #7778

mbacvanski opened this issue Jun 9, 2024 · 4 comments
Labels
a:sdk Area: sdk related issues c:sdk:public-api Component: All the issues that relate to wandb.Api with the exception of the public api of Artifacts

Comments

@mbacvanski
Copy link

I have several thousand runs in a project, and I'd like to download all their histories together. Manually looping over all the runs and querying their history with run.history(...) takes a very long time (hours), and it looks like the implementation of runs.histories(...) does the same thing.

If I query the runs in parallel, I get a requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.wandb.ai/graphql. Any suggestions on what to do?

Copy link

Jason Davenport commented:
When dealing with a large number of runs, running into rate limits (HTTP 429 errors) is a common issue. Here are some strategies to handle this more efficiently:

  1. Batch Requests: Instead of querying histories sequentially, use batching to minimize the number of API calls.
  2. Retry Logic with Exponential Backoff: Implement a retry mechanism that waits for progressively longer periods before retrying a request.
  3. Throttle Requests: Implement a throttle mechanism to ensure you stay within the API rate limits.

Here’s an example implementation:

import wandbimport timeimport pandas as pdfrom wandb.apis.public import Apifrom requests.exceptions import HTTPErrorInitialize W&B APIapi = Api()Function to fetch history of a single run with retries and exponential backoffdef fetch_run_history(run, max_retries=5, backoff_factor=1):for attempt in range(max_retries):try:return run.history()except HTTPError as e:if e.response.status_code == 429:# Too many requests, wait before retryingwait = backoff_factor * (2 ** attempt)print(f"Rate limit exceeded. Retrying in {wait} seconds…")time.sleep(wait)else:raise eraise Exception("Max retries exceeded")Function to fetch histories of all runs in a projectdef fetch_all_histories(project_name, max_retries=5, backoff_factor=1, batch_size=100):runs = api.runs(project_name)histories = []for i in range(0, len(runs), batch_size): batch = runs[i:i + batch_size] for run in batch: try: history = fetch_run_history(run, max_retries, backoff_factor) histories.append((run.name, history)) except Exception as e: print(f"Failed to fetch history for run {run.name}: {e}")return historiesFetch all histories for the projectproject_name = "your_project_name"histories = fetch_all_histories(project_name)Combine histories into a single DataFramecombined_histories = []for run_name, history in histories:history['run_name'] = run_namecombined_histories.append(history)df_combined = pd.concat(combined_histories, ignore_index=True)Save to CSV or handle as neededdf_combined.to_csv("combined_histories.csv", index=False)

The fetch_run_history function includes a retry mechanism with exponential backoff. If a rate limit error (HTTP 429) occurs, it waits for a progressively longer time before retrying. The fetch_all_histories function processes runs in batches to reduce the number of API calls made simultaneously. After fetching histories, they are combined into a single DataFrame.

This approach should help you download run histories more efficiently without excessively hitting API rate limits.

Copy link

Jason Davenport commented:
Hi there, I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

Copy link

Jason Davenport commented:
Hi Internal, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

@kptkin kptkin added a:sdk Area: sdk related issues c:sdk:public-api Component: All the issues that relate to wandb.Api with the exception of the public api of Artifacts labels Jun 27, 2024
@DavidEnriqueNieves
Copy link
DavidEnriqueNieves commented Nov 4, 2024

On a related note, is there a way to access multiple histories through the GraphQL API? Is that API even working at this time?

-David

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:sdk Area: sdk related issues c:sdk:public-api Component: All the issues that relate to wandb.Api with the exception of the public api of Artifacts
Projects
None yet
Development

No branches or pull requests

3 participants
0