-
Notifications
You must be signed in to change notification settings - Fork 109
Centered instance model scales input image (not cropped image) leading to error #872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can we just use the output stride of the centroid model to do a limited version of input scaling on the centered instance model (finite set of feasible stride values)? -> This would couple the centroid and centered instance models though, which we might not want. Problem AnalysisIt seems that the Relevant Code
Follow-up problems
Traceback
|
I too think I am now experiencing this issue; however, I am not sure why it is coming up for me now during training when I have been training on the same model for weeks. I have tried reinstalling SLEAP v1.2.9 and paying Google Colab for more compute capability (per discussion #871). Below is the dialogue, which appears similar to what @talmo posted, but the TF error comes up at the
...until I force stopped the process. |
Hi @amblypatty, Originally, we thought this error might be caused by the plotting just the visualizations (confidence maps overlaid on instances) during training; however, after tracking down the error, we found that the real problem was that our pipeline for the top-down model is not set-up to handle input scaling on the second model (the centered instance model). It seems your Unless I overlooked something, the logs seem to indicate that training has completed the 2nd epoch and is about to head into the 3rd epoch? Some clarifying questions: Are the logs truncated? What behavior are you experiencing? Thanks, |
Hi @roomrys, Indeed, I terminated the process after seeing the
The previous result after training the top-down model with this error in each epoch (though, the warning shows up earlier) was a predictions.pkg.slp file with 'mean scores' but no instances on the suggested frames when I run:
Where I get a complete prediction (with PredictCost() errors):
...and then merge the predictions in the SLEAP GUI. Additionally, there are no metrics for the centered_instance model: The image above shows, in the background, a suggested frame (313) that has a mean score but there is no predicted instance on the frame. In the foreground shows the evaluation metrics window where the most recent centered_instance model shows empty cells for the evaluation metrics, but the previous centered_instance model shows the metrics (expected). Thanks for your help, |
I am still experiencing this issue, even in the newest 1.3.0a0 release. I have tried redoing this with a few different hyperparameters to try and get the previously expected behavior, but I am still experiencing an error in the PredictCost() function. I am afraid I don't really know what it means or how to get around it. I would really appreciate some help on this one. Here is the latest output from my top-down training, first from the Centroid and then the Centered-Instance:
You will notice that there still is a metrics evaluation but with PredictCost() errors. I then predict on the suggested frames:
But the problem is that the prediction file is empty. Even though it has the same file size (57kb) as previous prediction files that have worked. When I merge the prediction file into my current SLEAP project, nothing happens. When I open the prediction file by itself, nothing shows up either, but it could be because there isn't a video file attached to it. Additionally, as in my previous comment, I am still unable to see a model metric evaluation. Please let me know if there is something else I can provide to help solve this issue. |
Hi @amblypatty, Could you share everything needed to do the training/inference (video, slp, models) and the 230312_144956_predicted_suggestions.slp to lmaree@salk.edu? Sorry, Github doesn't notify for reactions, but thanks for bumping this again - it had gotten buried.... Let's get you unstuck. Thanks, |
One of our labmates is also seeming to experience this issue. I can send to you if you want an example Liezl, but they're currently running an older SLEAP version. |
i think i am experiencing a similar issue. i am very new to this but reading through this it seems very similar to what happens for me. i have tried optimising the training parameters for my top-down multianimal model, and when i tweak the input scaling (and the max stride) settings, in some cases i receive an error message in the GUI saying that the training failed. for my centroid model, keeping the input scaling at 0.5 and the max stride at 32 works. but increasing the input scaling to 1.0 and the max stride to 64 i start seeing this issue. i will keep an eye on this issue. i just thought i would mention that i am experiencing this. thank you also for an amazing tool. i really like SLEAP. |
Hello, I am getting this issue as well, but at input scaling of 0.5. I need to use 0.5 to get the model to run on my 8GB GPU with 1280x1024 video, by changing that and by reducing filters from 64 to 48, and rate from 2 to 1.5, I was finally able to get the model to run. Attached error code. Is there anything I can do? Thanks for the support } Input 0 of layer "model" is incompatible with the layer: expected shape=(None, 592, 592, 3), found shape=(1, 296, 296, 3) |
Hi @smasri09, I know you said that you needed an input scaling of 0.5 to get the model to run on you 8GB GPU, but is there any way you can keep top-down-id model to an input scaling of 1 and just adjust the centroid model input 10000 scaling? Maybe even lowering it less than 0.5? Similar to the centered instance model, the top-down-id model does not support adjusting the input scaling - it relies on the centroid model taking crops of the full image to save on memory, but then keeps full resolution in the crop to accurately locate smaller body parts. Thanks, |
HI, I ran into the same issue training a leap-backbone top-down centered instance model with input scaling at 0.25 (it was just there by default, or maybe from an earlier run). Google brought me here and switching input scaling 1 fixed the problem. Here's the log in case that is helpful: Output Log:Using already trained model for centroid: /home/ammon/Documents/Scripts/FishTrack/sleap/models/240222_132820.centroid.n=20/training_config.json Input 0 of layer "model" is incompatible with the layer: expected shape=(None, 32, 32, 1), found shape=(1, 8, 8, 1) Call arguments received: |
Hi! Just want to add I'm running into this issue as well, with updated SLEAP from conda. Assume it is being worked on, but in the mean time was curious what other params (other than batch size) to tweak to make centerd inst training smaller for our GPU limits. Thanks! Luke Traceback (most recent call last):Traceback (most recent call last): File "/home/lmeyers/anaconda3/envs/sleap/bin/sleap-train", line 33, in sys.exit(load_entry_point('sleap==1.2.8', 'console_scripts', 'sleap-train')()) File "/home/lmeyers/anaconda3/envs/sleap/lib/python3.7/site-packages/sleap/nn/training.py", line 1981, in main trainer.train() File "/home/lmeyers/anaconda3/envs/sleap/lib/python3.7/site-packages/sleap/nn/training.py", line 927, in train verbose=2, File "/home/lmeyers/anaconda3/envs/sleap/lib/python3.7/site-packages/keras/engine/training.py", line 1230, in fit callbacks.on_epoch_end(epoch, epoch_logs) File "/home/lmeyers/anaconda3/envs/sleap/lib/python3.7/site-packages/keras/callbacks.py", line 413, in on_epoch_end callback.on_epoch_end(epoch, logs) File "/home/lmeyers/anaconda3/envs/sleap/lib/python3.7/site-packages/sleap/nn/callbacks.py", line 280, in on_epoch_end figure = self.plot_fn() File "/home/lmeyers/anaconda3/envs/sleap/lib/python3.7/site-packages/sleap/nn/training.py", line 1332, in viz_fn=lambda: visualize_example(next(training_viz_ds_iter)), File "/home/lmeyers/anaconda3/envs/sleap/lib/python3.7/site-packages/sleap/nn/training.py", line 1312, in visualize_example preds = find_peaks(tf.expand_dims(example["instance_image"], axis=0)) File "/home/lmeyers/anaconda3/envs/sleap/lib/python3.7/site-packages/keras/engine/base_layer.py", line 1037, in __call__ outputs = call_fn(inputs, *args, **kwargs) File "/home/lmeyers/anaconda3/envs/sleap/lib/python3.7/site-packages/sleap/nn/inference.py", line 1868, in call out = self.keras_model(crops) File "/home/lmeyers/anaconda3/envs/sleap/lib/python3.7/site-packages/keras/engine/base_layer.py", line 1020, in __call__ input_spec.assert_input_compatibility(self.input_spec, inputs, self.name) File "/home/lmeyers/anaconda3/envs/sleap/lib/python3.7/site-packages/keras/engine/input_spec.py", line 269, in assert_input_compatibility ', found shape=' + display_shape(x.shape)) ValueError: Input 0 is incompatible with layer model: expected shape=(None, 768, 768, 3), found shape=(1, 192, 192, 3) terminate called without an active exception train-script.sh: line 2: 32871 Aborted (core dumped) sleap-train centered_instance.json labels.v001.pkg.slp |
Not sure, but I think I might having an issue similar to this: #872 (comment) A single animal set of training params leads to GPU memory errors if I enable visualizations but doesn't if I don't... Happy to post json files or errors, if desirable. |
Uh oh!
There was an error while loading. Please reload this page.
I think the problem is that we generally expect an input scaling of 1.0 for centered instance models since they're crops already. The training does handle this appropriately, but not the visualization for some reason (it's probably missing the input scaling transformer/preprocessing).
In general, I think we can solve this by switching to using the
InferenceModel
classes to generate visualizations so that we're not doing some custom inference routines inside ofTrainer
classes.Here's the relevant error:
See issue below for more.
Discussed in #871
Originally posted by Shifulai July 29, 2022
Thank for your attention.
When I try to train the top-down centered instance model, the training cannot work when the input scaling is not 1.0. The train will stay at epoch1 but the runtime still add.
Bug report below
The text was updated successfully, but these errors were encountered: