v3.8.1

This is a patch release containing the following changes to v3.8:

Fixed correctness issue in reorder primitive with non-trivial strides on Intel CPUs (a762d32)
Fixed runtime error in convolution weight gradient on Xe2 architecture-based Intel GPUs (a8fac73, c409ef9)
Fixed performance regression in bf16 convolution on Intel Datacenter GPU Max Series (98170d0, c6bae4a, c5edd53, bb1a591)
Improved performance of fp16 matmul with fp8 compressed weights on Intel GPUs (58f3ec1, abff176, ffd7dd3, 3b1e855, 2e140de, 3429f79)
Fixed runtime error in fp16 pooling primitive on Xe2 architecture based Intel GPUs (c0f6b6d)
Improved performance of fp16 matmul with int4 weights and 32 < m <= 64 on Intel GPUs (2fa7072)
Fixed correctness issues in bf16 matmul with 3 or more dimensional tensors on processors with Intel AMX support (dd20965, ea1b4a1)
Fixed performance regression in fp16 or bf16 matmul with transposed source and weight tensors on Intel Datacenter GPU Max Series (e45e1aa)
Improved performance of bf16 matmul with int4 weights on Intel GPUs (7a15c23)
Fixed runtime error in fp16 SDPA subgraph with head size 512 on Intel Core Ultra (Series 2) processor integrated GPU (bde6985)

Provide feedback