8000 GitHub - kevbuh/bitnet: Minimal implementation of Microsoft's BitNet b1.58 2B4T
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

kevbuh/bitnet

Repository files navigation

bitnet

bitnet is based on Microsoft's BitNet b1.58 2B4T, a binarized LLaMa3-style LLM (ternary‐weight STE, per‐token 8-bit abs-max activation, SubLN, ReLU² FFN, RoPE / GQA attention, no biases) with 2.4B parameters trained on four trillion tokens.

tldr; No more floats. Just weights in [1, 0, -1].

Setup

chmod +x setup.sh
./setup.sh
source venv/bin/activate

Papers

Notes

Notes from HF model card

  • Parameters: 2,412,820,480 (2.4B)
  • Context Length: 4096 tokens
  • Weights: 1.58-bit with 8-bit activations (W1.58A8)
  • Model: Based off of LLaMa
    • Modified with BitLinear layers
    • Uses Rotary Position Embeddings (RoPE).
    • Uses squared ReLU (ReLU²) activation in FFN layers
    • Employs Sub-LayerNorm normalization
    • No bias terms in linear or normalization layers
      • Binarization is a form of regularization. By reducing precision, the model generalizes better
  • Tokenizer: LLaMA 3 Tokenizer (vocab size: 128,256)
  • STE: Straight-through-Estimator to approximate gradients for non-differentiable functions like clip()
  • Quantization Function: It first scales the weight matrix by its average absolute value, and then rounds each value to the nearest integer among {-1, 0, +1}
  • Binarized LLMs training loss curve follow an S shape

Model Architecture

config.json:

{
  "architectures": [
    "BitNetForCausalLM"
  ],
  "auto_map": {
    "AutoConfig": "configuration_bitnet.BitNetConfig",
    "AutoModelForCausalLM": "modeling_bitnet.BitNetForCausalLM"
  },
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "relu2",
  "hidden_size": 2560,
  "initializer_range": 0.02,
  "intermediate_size": 6912,
  "max_position_embeddings": 4096,
  "model_type": "bitnet",
  "rms_norm_eps": 1e-05,
  "num_attention_heads": 20,
  "num_hidden_layers": 30,
  "num_key_value_heads": 5,
  "rope_theta": 500000.0,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "use_cache": true,
  "vocab_size": 128256,
  "quantization_config": {
    "quant_method": "bitnet",
    "linear_class": "autobitlinear",
    "quantization_mode": "online"
  }
}

Layer Info (2,412,820,480 parameters)

[Layer name]                                    [Weight shape]             [#Params] [Sample weights]
model.embed_tokens.weight                       torch.Size([128256, 2560]) 328335360 [-0.45703125, 0.90625, 0.69140625, 0.73046875, -0.171875]
model.layers.0.input_layernorm.weight           torch.Size([2560])         2560      [0.0174560546875, 0.0179443359375, 0.019287109375, 0.0274658203125, 0.01300048828125]
model.layers.0.mlp.down_proj.weight             torch.Size([2560, 6912])   17694720  [-1.1328125, -0.46484375, 6.40625, -1.5703125, 0.77734375]
model.layers.0.mlp.ffn_sub_norm.weight          torch.Size([6912])         6912      [1.1875, 1.1953125, 1.3046875, 0.69140625, 3.234375]
model.layers.0.mlp.gate_proj.weight             torch.Size([6912, 2560])   17694720  [0.7734375, 1.84375, 1.15625, -0.6640625, 0.77734375]
model.layers.0.mlp.up_proj.weight               torch.Size([6912, 2560])   17694720  [0.58984375, 2.546875, -1.625, -0.8984375, -5.1875]
model.layers.0.post_attention_layernorm.weight  torch.Size([2560])         2560      [1.34375, 1.3359375, 1.3203125, 1.5703125, 1.2265625]
model.layers.0.self_attn.attn_sub_norm.weight   torch.Size([2560])         2560      [0.0128173828125, 0.0166015625, 0.0152587890625, 0.01513671875, 0.01495361328125]
model.layers.0.self_attn.k_proj.weight          torch.Size([640, 2560])    1638400   [-0.90625, -0.890625, 2.953125, -4.8125, 0.89453125]
model.layers.0.self_attn.o_proj.weight          torch.Size([2560, 2560])   6553600   [-0.458984375, 0.482421875, -4.25, -3.015625, -2.671875]
model.layers.0.self_attn.q_proj.weight          torch.Size([2560, 2560])   6553600   [0.59765625, -0.1904296875, 0.45703125, -2.6875, -0.60546875]
model.layers.0.self_attn.v_proj.weight          torch.Size([640, 2560])    1638400   [7.15625, 1.171875, -0.54296875, 1.1640625, 0.95703125]
model.layers.1.input_layernorm.weight           torch.Size([2560])         2560      [0.016845703125, 0.01531982421875, 0.0172119140625, 0.01409912109375, 0.01611328125]
model.layers.1.mlp.down_proj.weight             torch.Size([2560, 6912])   17694720  [-1.1886598875514971e-34, -2.3773197751029943e-34, 0.63671875, 0.57421875, -4.152786442584977e-34]
model.layers.1.mlp.ffn_sub_norm.weight          torch.Size([6912])         6912      [1.0159280456642669e-32, 1.0785207688568521e-32, 2.28125, 0.4453125, 1.0592614694129797e-32]
model.layers.1.mlp.gate_proj.weight             torch.Size([6912, 2560])   17694720  [-6.229179663877466e-34, -3.2048677980818847e-34, -3.445609041130289e-34, -5.657419211637505e-34, 7.342607912976337e-34]
model.layers.1.mlp.up_proj.weight               torch.Size([6912, 2560])   17694720  [2.3321807920314184e-34, 7.748858760620519e-35, -9.930576275746685e-35, -4.739593222515463e-35, -3.385423730368188e-34]
model.layers.1.post_attention_layernorm.weight  torch.Size([2560])         2560      [1.3203125, 1.328125, 1.203125, 1.234375, 1.1875]
model.layers.1.self_attn.attn_sub_norm.weight   torch.Size([2560])         2560      [0.2060546875, 0.330078125, 0.318359375, 0.2890625, 0.291015625]
model.layers.1.self_attn.k_proj.weight          torch.Size([640, 2560])    1638400   [0.66796875, -4.90625, -0.67578125, -0.0157470703125, 0.6875]
model.layers.1.self_attn.o_proj.weight          torch.Size([2560, 2560])   6553600   [-0.796875, -0.328125, -4.0625, 0.5078125, 3.734375]
model.layers.1.self_attn.q_proj.weight          torch.Size([2560, 2560])   6553600   [-0.1669921875, -0.416015625, -0.1689453125, 0.4140625, 0.40625]
model.layers.1.self_attn.v_proj.weight          torch.Size([640, 2560])    1638400   [1.3046875, -0.006378173828125, 0.076171875, 1.125, 1.125]
model.layers.2.input_layernorm.weight           torch.Size([2560])         2560      [0.0205078125, 0.0184326171875, 0.0166015625, 0.01904296875, 0.0185546875]
model.layers.2.mlp.down_proj.weight             torch.Size([2560, 6912])   17694720  [2.421875, 9.103028252767794e-35, 0.2392578125, -3.325238419606087e-34, 4.78125]
model.layers.2.mlp.ffn_sub_norm.weight          torch.Size([6912])         6912      [0.75390625, 1.1651876163542777e-32, 0.3828125, 1.1700024412152458e-32, 0.8828125]
model.layers.2.mlp.gate_proj.weight             torch.Size([6912, 2560])   17694720  [2.421875, 0.56640625, -0.640625, 0.5546875, -0.255859375]
model.layers.2.mlp.up_proj.weight               torch.Size([6912, 2560])   17694720  [2.1875, 0.51171875, -0.82421875, -0.470703125, 0.50390625]
model.layers.2.post_attention_layernorm.weight  torch.Size([2560])         2560      [1.2890625, 1.296875, 1.140625, 1.2734375, 1.1796875]
model.layers.2.self_attn.attn_sub_norm.weight   torch.Size([2560])         2560      [0.45703125, 0.4921875, 0.45703125, 0.419921875, 0.5]
model.layers.2.self_attn.k_proj.weight          torch.Size([640, 2560])    1638400   [-0.625, -0.189453125, -0.75390625, 2.78125, -2.234375]
model.layers.2.self_attn.o_proj.weight          torch.Size([2560, 2560])   6553600   [2.609375, -4.0, -0.7734375, -0.96484375, 2.25]
model.layers.2.self_attn.q_proj.weight          torch.Size([2560, 2560])   6553600   [0.5, 0.50390625, 0.63671875, 0.423828125, -0.578125]
model.layers.2.self_attn.v_proj.weight          torch.Size([640, 2560])    1638400   [1.0234375, 1.6875, -0.94921875, -0.76953125, -6.5]
model.layers.3.input_layernorm.weight           torch.Size([2560])         2560      [0.021484375, 0.0194091796875, 0.0205078125, 0.0181884765625, 0.018798828125]
model.layers.3.mlp.down_proj.weight             torch.Size([2560, 6912])   17694720  [1.0078125, -3.46875, -0.77734375, 5.34375, 5.4072740137825225e-37]
model.layers.3.mlp.ffn_sub_norm.weight          torch.Size([6912])         6912      [0.65625, 0.890625, 0.921875, 0.921875, 1.1459283169104053e-32]
model.layers.3.mlp.gate_proj.weight             torch.Size([6912, 2560])   17694720  [4.5, -7.9375, 0.875, 4.46875, 0.921875]
model.layers.3.mlp.up_proj.weight               torch.Size([6912, 2560])   17694720  [1.8671875, -0.98046875, -1.6953125, 2.328125, 1.296875]
model.layers.3.post_attention_layernorm.weight  torch.Size([2560])         2560      [1.3046875, 1.359375, 1.1796875, 1.3125, 1.21875]
model.layers.3.self_attn.attn_sub_norm.weight   torch.Size([2560])         2560      [0.4140625, 0.31640625, 0.39453125, 0.38671875, 0.419921875]
model.layers.3.self_attn.k_proj.weight          torch.Size([640, 2560])    1638400   [0.59765625, 0.002410888671875, 0.1875, 0.765625, 0.546875]
model.layers.3.self_attn.o_proj.weight          torch.Size([2560, 2560])   6553600   [-1.265625, 0.765625, -0.9765625, 3.34375, -5.5]
model.layers.3.self_attn.q_proj.weight          torch.Size([2560, 2560])   6553600   [0.7265625, 0.515625, -5.5, -0.4765625, 0.486328125]
model.layers.3.self_attn.v_proj.weight          torch.Size([640, 2560])    1638400   [1.390625, 4.8125, -1.25, 1.3515625, -5.34375]
model.layers.4.input_layernorm.weight           torch.Size([2560])         2560      [0.0186767578125, 0.0185546875, 0.0177001953125, 0.019775390625, 0.0162353515625]
model.layers.4.mlp.down_proj.weight             torch.Size([2560, 6912])   17694720  [1.0703125, -1.078125, 2.90625, -0.84765625, -0.9453125]
model.layers.4.mlp.ffn_sub_norm.weight          torch.Size([6912])         6912      [0.7421875, 0.2314453125, 0.5390625, 0.8984375, 1.0390625]
model.layers.4.mlp.gate_proj.weight             torch.Size([6912, 2560])   17694720  [-0.1650390625, 1.046875, -2.90625, -1.0546875, -0.353515625]
model.layers.4.mlp.up_proj.weight               torch.Size([6912, 2560])   17694720  [2.484375, 0.75, -0.9765625, -0.294921875, -4.25]
model.layers.4.post_attention_layernorm.weight  torch.Size([2560])         2560      [1.3125, 1.3671875, 1.2109375, 1.3046875, 1.2109375]
model.layers.4.self_attn.attn_sub_norm.weight   torch.Size([2560])         2560      [0.322265625, 0.302734375, 0.357421875, 0.3984375, 0.26953125]
model.layers.4.self_attn.k_proj.weight          torch.Size([640, 2560])    1638400   [0.90625, 1.0390625, 0.7421875, 0.5703125, -1.6953125]
model.layers.4.self_attn.o_proj.weight          torch.Size([2560, 2560])   6553600   [-2.5625, 1.4140625, 1.0625, -1.0703125, -1.265625]
model.layers.4.self_attn.q_proj.weight          torch.Size([2560, 2560])   6553600   [-0.5, -0.1416015625, -0.01458740234375, 0.46484375, 0.47265625]
model.layers.4.self_attn.v_proj.weight          torch.Size([640, 2560])    1638400   [1.515625, -1.53125, -2.0, 1.6171875, -1.8046875]
model.layers.5.input_layernorm.weight           torch.Size([2560])         2560      [0.0155029296875, 0.015869140625, 0.01611328125, 0.0145263671875, 0.01507568359375]
model.layers.5.mlp.down_proj.weight             torch.Size([2560, 6912])   17694720  [0.057373046875, -7.28125, 1.921875, 3.765625, -0.8125]
model.layers.5.mlp.ffn_sub_norm.weight          torch.Size([6912])         6912      [1.84375, 0.94921875, 0.70703125, 1.046875, 1.078125]
model.layers.5.mlp.gate_proj.weight             torch.Size([6912, 2560])   17694720  [0.90625, 1.6171875, 3.546875, -3.640625, 1.140625]
model.layers.5.mlp.up_proj.weight               torch.Size([6912, 2560])   17694720  [2.96875, 1.0078125, -0.11767578125, -0.67578125, 3.875]
model.layers.5.post_attention_layernorm.weight  torch.Size([2560])         2560      [1.34375, 1.3984375, 1.2265625, 1.34375, 1.2421875]
model.layers.5.self_attn.attn_sub_norm.weight   torch.Size([2560])         2560      [0.625, 0.5546875, 0.546875, 0.64453125, 0.5546875]
model.layers.5.self_attn.k_proj.weight          torch.Size([640, 2560])    1638400   [-0.97265625, -6.75, -0.80859375, -0.88671875, 0.97265625]
model.layers.5.self_attn.o_proj.weight          torch.Size([2560, 2560])   6553600   [-2.515625, -1.046875, -4.34375, -1.0859375, 1.0625]
model.layers.5.self_attn.q_proj.weight          torch.Size([2560, 2560])   6553600   [0.72265625, 0.6328125, -0.4609375, -0.54296875, -0.6484375]
model.layers.5.self_attn.v_proj.weight          torch.Size([640, 2560])    1638400   [0.1767578125, 1.3046875, -7.375, 4.46875, -4.28125]
model.layers.6.input_layernorm.weight           torch.Size([2560])         2560      [0.017822265625, 0.0159912109375, 0.0184326171875, 0.0179443359375, 0.016357421875]
model.layers.6.mlp.down_proj.weight             torch.Size([2560, 6912])   17694720  [-1.1640625, 0.025634765625, 1.140625, -3.015625, 0.8359375]
model.layers.6.mlp.ffn_sub_norm.weight          torch.Size([6912])         6912      [0.5625, 0.271484375, 1.640625, 0.1826171875, 0.53125]
model.layers.6.mlp.gate_proj.weight             torch.Size([6912, 2560])   17694720  [-4.28125, 1.0390625, -0.765625, 1.3984375, -6.78125]
model.layers.6.mlp.up_proj.weight               torch.Size([6912, 2560])   17694720  [-0.4296875, -0.91015625, -0.3046875, 0.5859375, 0.267578125]
model.layers.6.post_attention_layernorm.weight  torch.Size([2560])         2560      [1.3515625, 1.421875, 1.296875, 1.359375, 1.2421875]
model.layers.6.self_attn.attn_sub_norm.weight   torch.Size([2560])         2560      [0.64453125, 0.6484375, 0.58203125, 0.64453125, 0.60546875]
model.layers.6.self_attn.k_proj.weight          torch.Size([640, 2560])    1638400   [-1.09375, 0.478515625, -1.0625, 0.283203125, -1.078125]
model.layers.6.self_attn.o_proj.weight          torch.Size([2560, 2560])   6553600   [-1.3515625, 0.51171875, 1.171875, 0.65625, 1.1796875]
model.layers.6.self_attn.q_proj.weight          torch.Size([2560, 2560])   6553600   [0.5546875, 2.0625, 0.67578125, 0.80859375, 0.671875]
model.layers.6.self_attn.v_proj.weight          torch.Size([640, 2560])    1638400   [1.734375, 1.234375, -1.71875, -0.470703125, -1.7421875]
model.layers.7.input_layernorm.weight           torch.Size([2560])         2560      [0.0166015625, 0.0157470703125, 0.0150146484375, 0.015869140625, 0.01495361328125]
model.layers.7.mlp.down_proj.weight             torch.Size([2560, 6912])   17694720  [1.1953125, 1.1875, -3.109375, 0.2421875, -0.138671875]
model.layers.7.mlp.ffn_sub_norm.weight          torch.Size([6912])         6912      [0.734375, 1.6015625, 1.4609375, 0.98046875, 1.0390625]
model.layers.7.mlp.gate_proj.weight             torch.Size([6912, 2560])   17694720  [-2.046875, 0.98046875, 1.015625, 0.9609375, 0.11669921875]
model.layers.7.mlp.up_proj.weight               torch.Size([6912, 2560])   17694720  [0.765625, 0.875, 1.0703125, -1.296875, -2.5]
model.layers.7.post_attention_layernorm.weight  torch.Size([2560])         2560      [1.359375, 1.3984375, 1.3203125, 1.359375, 1.25]
model.layers.7.self_attn.attn_sub_norm.weight   torch.Size([2560])         2560      [0.7109375, 0.77734375, 0.8359375, 0.80078125, 0.828125]
model.layers.7.self_attn.k_proj.weight          torch.Size([640, 2560])    1638400   [-0.412109375, -1.125, -1.140625, -0.86328125, 0.5546875]
model.layers.7.self_attn.o_proj.weight          torch.Size([2560, 2560])   6553600   [-3.796875, -0.85546875, -12.4375, -1.125, 2.953125]
model.layers.7.self_attn.q_proj.weight          torch.Size([2560, 2560])   6553600   [-0.80859375, 0.396484375, -0.703125, -0.671875, -0.265625]
model.layers.7.self_attn.v_proj.weight          torch.Size([640, 2560])    1638400   [-2.921875, -0.9609375, 1.6171875, 0.59375, -1.6015625]
model.layers.8.input_layernorm.weight           torch.Size([2560])         2560      [0.0169677734375, 0.01708984375, 0.0166015625, 0.0167236328125, 0.01513671875]
model.layers.8.mlp.down_proj.weight             torch.Size([2560, 6912])   17694720  [6.8125, 1.2734375, -1.171875, 5.0, -1.3125]
model.layers.8.mlp.ffn_sub_norm.weight          torch.Size([6912])         6912      [1.0859375, 0.51171875, <
F41A
span class="pl-c1">0.90234375, 0.5078125, 0.95703125]
model.layers.8.mlp.gate_proj.weight             torch.Size([6912, 2560])   17694720  [-2.140625, 1.1328125, -0.65625, -0.1025390625, 0.6875]
model.layers.8.mlp.up_proj.weight               torch.Size([6912, 2560])   17694720  [0.9453125, -3.890625, 0.84765625, -0.94921875, -3.1875]
model.layers.8.post_attention_layernorm.weight  torch.Size([2560])         2560      [1.359375, 1.375, 1.3046875, 1.359375, 1.234375]
model.layers.8.self_attn.attn_sub_norm.weight   torch.Size([2560])         2560      [0.90625, 0.84375, 0.9296875, 0.87890625, 0.89453125]
model.layers.8.self_attn.k_proj.weight          torch.Size([640, 2560])    1638400   [-1.140625, 1.0703125, -0.11865234375, 1.7265625, 1.140625]
model.layers.8.self_attn.o_proj.weight          torch.Size([2560, 2560])   6553600   [-1.09375, -1.3203125, 0.439453125, -1.3125, -3.703125]
model.layers.8.self_attn.q_proj.weight          torch.Size([2560, 2560])   6553600   [0.50390625, 0.78515625, 0.671875, 0.57421875, 0.7265625]
model.layers.8.self_attn.v_proj.weight          torch.Size([640, 2560])    1638400   [2.5, 7.75, 13.125, 7.3125, -8.375]
model.layers.9.input_layernorm.weight           torch.Size([2560])         2560      [0.01300048828125, 0.01470947265625, 0.01263427734375, 0.0152587890625, 0.0123291015625]
model.layers.9.mlp.down_proj.weight             torch.Size([2560, 6912])   17694720  [1.203125, -3.46875, -1.3125, -1.6796875, -1.3125]
model.layers.9.mlp.ffn_sub_norm.weight          torch.Size([6912])         6912      [1.046875, 3.8125, 2.546875, 0.83984375, 1.9609375]
model.layers.9.mlp.gate_proj.weight             torch.Size([6912, 2560])   17694720  [-0.69921875, 1.09375, 8.0, 0.92578125, -2.0]
model.layers.9.mlp.up_proj.weight               torch.Size([6912, 2560])   17694720  [6.8125, 0.95703125, -1.6328125, 2.25, 1.078125]
model.layers.9.post_attention_layernorm.weight  torch.Size([2560])         2560      [1.296875, 1.2578125, 1.28125, 1.3203125, 1.1875]
model.layers.9.self_attn.attn_sub_norm.weight   torch.Size([2560])         2560      [0.8125, 0.83203125, 0.94140625, 0.84375, 0.8125]
model.layers.9.self_attn.k_proj.weight          torch.Size([640, 2560])    1638400   [-0.1396484375, -1.0234375, -1.1640625, -1.171875, 1.1015625]
model.layers.9.self_attn.o_proj.weight          torch.Size([2560, 2560])   6553600   [0.96484375, 2.375, -6.375, -0.93359375, 10.25]
model.layers.9.self_attn.q_proj.weight          torch.Size([2560, 2560])   6553600   [-0.7421875, 4.46875, 0.66015625, 2.53125, -0.5625]
model.layers.9.self_attn.v_proj.weight          torch.Size([640, 2560])    1638400   [10.125, 1.8125, 1.8125, 6.90625, 9.25]
model.layers.10.input_layernorm.weight          torch.Size([2560])         2560      [0.0172119140625, 0.01556396484375, 0.013916015625, 0.015869140625, 0.013427734375]
model.layers.10.mlp.down_proj.weight            torch.Size([2560, 6912])   17694720  [-3.9375, 10.875, -0.31640625, -0.89453125, -1.1328125]
model.layers.10.mlp.ffn_sub_norm.weight         torch.Size([6912])         6912      [3.265625, 1.3828125, 0.7578125, 1.3515625, 1.171875]
model.layers.10.mlp.gate_proj.weight            torch.Size([6912, 2560])   17694720  [-0.73828125, -0.78515625, -0.283203125, -6.09375, 1.3125]
model.layers.10.mlp.up_proj.weight              torch.Size([6912, 2560])   17694720  [-5.34375, 0.7421875, 0.91015625, -2.25, 0.98046875]
model.layers.10.post_attention_layernorm.weight torch.Size([2560])         2560      [1.296875, 1.265625, 1.28125, 1.328125, 1.21875]
model.layers.10.self_attn.attn_sub_norm.weight  torch.Size([2560])         2560      [0.91015625, 0.92578125, 0.921875, 0.890625, 0.875]
model.layers.10.self_attn.k_proj.weight         torch.Size([640, 2560])    1638400   [0.625, -0.609375, 1.25, 0.0791015625, 1.265625]
model.layers.10.self_attn.o_proj.weight         torch.Size([2560, 2560])   6553600   [-0.369140625, 1.3125, -6.78125, -1.28125, 7.8125]
model.layers.10.self_attn.q_proj.weight         torch.Size([2560, 2560])   6553600   [0.396484375, -0.76953125, 0.1005859375, 0.35546875, 0.78125]
model.layers.10.self_attn.v_proj.weight         torch.Size([640, 2560])    1638400   [1.90625, -1.9140625, -1.9140625, 1.921875, -3.84375]
model.layers.11.input_layernorm.weight          torch.Size([2560])         2560      [0.01397705078125, 0.0157470703125, 0.0152587890625, 0.0172119140625, 0.0130615234375]
model.layers.11.mlp.down_proj.weight            torch.Size([2560, 6912])   17694720  [-0.8515625, -5.875, 1.2421875, 1.234375, -1.1484375]
model.layers.11.mlp.ffn_sub_norm.weight         torch.Size([6912])         6912      [1.5234375, 0.94140625, 1.71875, 0.66015625, 1.609375]
model.layers.11.mlp.gate_proj.weight            torch.Size([6912, 2560])   17694720  [-1.03125, -3.796875, -1.0078125, -4.0, -1.3359375]
model.layers.11.mlp.up_proj.weight              torch.Size([6912, 2560])   17694720  [3.421875, 6.40625, -1.015625, -1.1875, -1.0390625]
model.layers.11.post_attention_layernorm.weight torch.Size([2560])         2560      [1.3125, 1.3359375, 1.3125, 1.390625, 1.2421875]
model.layers.11.self_attn.attn_sub_norm.weight  torch.Size([2560])         2560      [2.03125, 2.46875, 2.171875, 2.3125, 2.265625]
model.layers.11.self_attn.k_proj.weight         torch.Size([640, 2560])    1638400   [-1.1640625, -1.265625, -1.2265625, -1.265625, 1.296875]
model.layers.11.self_attn.o_proj.weight         torch.Size([2560, 2560])   6553600   [6.5, 1.5625, -1.359375, -1.375, -1.5078125]
model.layers.11.self_attn.q_proj.weight         torch.Size([2560, 2560])   6553600   [0.5234375, 0.259765625, 0.75390625, -0.6796875, -0.61328125]
model.layers.11.self_attn.v_proj.weight         torch.Size([640, 2560])    1638400   [2.53125, -0.0927734375, 0.482421875, -3.890625, -1.9921875]
model.layers.12.input_layernorm.weight          torch.Size([2560])         2560      [0.01373291015625, 0.01373291015625, 0.01422119140625, 0.0137939453125, 0.01220703125]
model.layers.12.mlp.down_proj.weight            torch.Size([2560, 6912])   17694720  [0.875, -0.5, 1.296875, 9.375, -2.46875]
model.layers.12.mlp.ffn_sub_norm.weight         torch.Size([6912])         6912      [2.265625, 1.6796875, 1.34375, 1.8359375, 0.74609375]
model.layers.12.mlp.gate_proj.weight            torch.Size([6912, 2560])   17694720  [-0.94140625, -2.1875, 2.34375, -1.0390625, 3.46875]
model.layers.12.mlp.up_proj.weight              torch.Size([6912, 2560])   17694720  [0.75, -0.96875, 1.28125, -0.80078125, -1.015625]
model.layers.12.post_attention_layernorm.weight torch.Size([2560])         2560      [1.3125, 1.34375, 1.28125, 1.40625, 1.203125]
model.layers.12.self_attn.attn_sub_norm.weight  torch.Size([2560])         2560      [1.2734375, 1.375, 1.3984375, 1.3125, 1.3515625]
model.layers.12.self_attn.k_proj.weight         torch.Size([640, 2560])    1638400   [1.0546875, -0.84765625, 0.408203125, -1.3828125, -1.1953125]
model.layers.12.self_attn.o_proj.weight         torch.Size([2560, 2560])   6553600   [-1.765625, 7.0, 0.87109375, 1.5703125, 8.75]
model.layers.12.self_attn.q_proj.weight         torch.Size([2560, 2560])   6553600   [-0.66015625, -0.828125, -0.6328125, 0.95703125, -0.91015625]
model.layers.12.self_attn.v_proj.weight         torch.Size([640, 2560])    1638400   [-0.1083984375, 0.51171875, -1.9453125, -2.734375, -2.21875]
model.layers.13.input_layernorm.weight          torch.Size([2560])         2560      [0.01336669921875, 0.0133056640625, 0.01318359375, 0.0133056640625, 0.01214599609375]
model.layers.13.mlp.down_proj.weight            torch.Size([2560, 6912])   17694720  [-5.9375, 0.98046875, -1.453125, 4.375, -1.21875]
model.layers.13.mlp.ffn_sub_norm.weight         torch.Size([6912])         6912      [2.671875, 2.21875, 2.390625, 1.203125, 2.734375]
model.layers.13.mlp.gate_proj.weight            torch.Size([6912, 2560])   17694720  [-1.421875, -1.046875, -1.1328125, 3.515625, -3.03125]
model.layers.13.mlp.up_proj.weight              torch.Size([6912, 2560])   17694720  [0.734375, -2.921875, 0.96875, -1.3515625, 1.03125]
model.layers.13.post_attention_layernorm.weight torch.Size([2560])         2560      [1.234375, 1.2421875, 1.28125, 1.3515625, 1.171875]
model.layers.13.self_attn.attn_sub_norm.weight  torch.Size([2560])         2560      [1.53125, 1.484375, 1.515625, 1.3828125, 1.5234375]
model.layers.13.self_attn.k_proj.weight         torch.Size([640, 2560])    1638400   [0.059326171875, 1.265625, -1.25, 1.2421875, -0.39453125]
model.layers.13.self_attn.o_proj.weight         torch.Size([2560, 2560])   6553600   [0.462890625, 1.6875, 16.25, -1.75, -4.4375]
model.layers.13.self_attn.q_proj.weight         torch.Size([2560, 2560])   6553600   [-0.298828125, 0.8125, 0.49609375, 0.76953125, -0.8359375]
model.layers.13.self_attn.v_proj.weight         torch.Size([640, 2560])    1638400   [-0.578125, 2.078125, -1.9296875, 6.09375, 2.09375]
model.layers.14.input_layernorm.weight          torch.Size([2560])         2560      [0.01336669921875, 0.0135498046875, 0.01422119140625, 0.01458740234375, 0.01324462890625]
model.layers.14.mlp.down_proj.weight            torch.Size([2560, 6912])   17694720  [1.1328125, 1.25, 1.09375, 10.75, 0.32421875]
model.layers.14.mlp.ffn_sub_norm.weight         torch.Size([6912])         6912      [2.890625, 2.125, 1.6015625, 2.8125, 2.390625]
model.layers.14.mlp.gate_proj.weight            torch.Size([6912, 2560])   17694720  [-0.1875, 1.2734375, 0.71484375, 0.96875, -1.140625]
model.layers.14.mlp.up_proj.weight              torch.Size([6912, 2560])   17694720  [3.625, 1.203125, 3.34375, -0.76171875, -1.515625]
model.layers.14.post_attention_layernorm.weight torch.Size([2560])         2560      [1.2578125, 1.265625, 1.2578125, 1.3828125, 1.1484375]
model.layers.14.self_attn.attn_sub_norm.weight  torch.Size([2560])         2560      [2.453125, 2.21875, 2.171875, 2.25, 2.375]
model.layers.14.self_attn.k_proj.weight         torch.Size([640, 2560])    1638400   [-0.1650390625, 1.5, -1.203125, 0.30078125, 1.4140625]
model.layers.14.self_attn.o_proj.weight         torch.Size([2560, 2560])   6553600   [1.296875, 1.25, 8.9375, -4.875, -5.25]
model.layers.14.self_attn.q_proj.weight         torch.Size([2560, 2560])   6553600   [0.9140625, -0.09228515625, -0.6015625, -0.42578125, 0.400390625]
model.layers.14.self_attn.v_proj.weight         torch.Size([640, 2560])    1638400   [-2.09375, -3.875, -7.25, 4.28125, -18.0]
model.layers.15.input_layernorm.weight          torch.Size([2560])         2560      [0.01214599609375, 0.0157470703125, 0.01214599609375, 0.012939453125, 0.01153564453125]
model.layers.15.mlp.down_proj.weight            torch.Size([2560, 6912])   17694720  [-3.28125, -1.3046875, -1.4921875, 2.15625, 4.34375]
model.layers.15.mlp.ffn_sub_norm.weight         torch.Size([6912])         6912      [3.46875, 1.6015625, 1.4921875, 3.140625, 1.4609375]
model.layers.15.mlp.gate_proj.weight            torch.Size([6912, 2560])   17694720  [-1.078125, -1.078125, -7.21875, 9.1875, -0.31640625]
model.layers.15.mlp.up_proj.weight              torch.Size([6912, 2560])   17694720  [-1.046875, 1.0703125, 7.4375, 1.03125, 0.62109375]
model.layers.15.post_attention_layernorm.weight torch.Size([2560])         2560      [1.359375, 1.421875, 1.3828125, 1.484375, 1.296875]
model.layers.15.self_attn.attn_sub_norm.weight  torch.Size([2560])         2560      [1.765625, 1.9375, 1.609375, 2.0625, 2.046875]
model.layers.15.self_attn.k_proj.weight         torch.Size([640, 2560])    1638400   [1.2109375, -0.7578125, -1.359375, 1.3671875, -1.171875]
model.layers.15.self_attn.o_proj.weight         torch.Size([2560, 2560])   6553600   [1.421875, 3.640625, 3.625, -1.4140625, -1.3984375]
model.layers.15.self_attn.q_proj.weight         torch.Size([2560, 2560])   6553600   [-0.83203125, 0.1923828125, -0.83984375, -0.5390625, -0.84765625]
model.layers.15.self_attn.v_proj.weight         torch.Size([640, 2560])    1638400   [-5.65625, -0.79296875, 8.375, -2.25, -2.25]
model.layers.16.input_layernorm.weight          torch.Size([2560])         2560      [0.0120849609375, 0.01190185546875, 0.01080322265625, 0.0128173828125, 0.010009765625]
model.layers.16.mlp.down_proj.weight            torch.Size([2560, 6912])   17694720  [-1.3046875, 12.0, 1.3203125, -3.5625, 5.34375]
model.layers.16.mlp.ffn_sub_norm.weight         torch.Size([6912])         6912      [2.34375, 3.109375, 1.9921875, 1.90625, 4.8125]
model.layers.16.mlp.gate_proj.weight            torch.Size([6912, 2560])   17694720  [5.65625, -1.109375, 0.62109375, -0.80859375, -5.3125]
model.layers.16.mlp.up_proj.weight              torch.Size([6912, 2560])   17694720  [-6.90625, -1.03125, 7.1875, -0.90234375, 0.7890625]
model.layers.16.post_attention_layernorm.weight torch.Size([2560])         2560      [1.3671875, 1.4453125, 1.3671875, 1.453125, 1.296875]
model.layers.16.self_attn.attn_sub_norm.weight  torch.Size([2560])         2560      [1.921875, 1.9296875, 1.9453125, 1.9453125, 2.0]
model.layers.16.self_attn.k_proj.weight         torch.Size([640, 2560])    1638400   [-1.5390625, -1.2578125, 1.5625, 1.515625, -0.4765625]
model.layers.16.self_attn.o_proj.weight         torch.Size([2560, 2560])   6553600   [1.390625, 5.3125, -1.40625, -3.296875, -1.21875]
model.layers.16.self_attn.q_proj.weight         torch.Size([2560, 2560])   6553600   [-2.40625, -1.0078125, -0.921875, -0.455078125, -1.0234375]
model.layers.16.self_attn.v_proj.weight         torch.Size([640, 2560])    1638400   [3.484375, -2.140625, 3.328125, 10.4375, -4.5]
model.layers.17.input_layernorm.weight          torch.Size([2560])         2560      [0.01214599609375, 0.0126953125, 0.01275634765625, 0.0125732421875, 0.01263427734375]
model.layers.17.mlp.down_proj.weight            torch.Size([2560, 6912])   17694720  [-0.8359375, -12.875, -1.9296875, 6.34375, 1.34375]
model.layers.17.mlp.ffn_sub_norm.weight         torch.Size([6912])         6912      [3.234375, 3.140625, 2.671875, 1.8515625, 2.171875]
model.layers.17.mlp.gate_proj.weight            torch.Size([6912, 2560])   17694720  [-1.21875, 0.50390625, 0.8671875, -1.109375, 1.203125]
model.layers.17.mlp.up_proj.weight              torch.Size([6912, 2560])   17694720  [2.484375, -1.6875, -1.0546875, -0.69140625, -3.578125]
model.layers.17.post_attention_layernorm.weight torch.Size([2560])         2560      [1.3828125, 1.4765625, 1.3984375, 1.484375, 1.3046875]
model.layers.17.self_attn.attn_sub_norm.weight  torch.Size([2560])         2560      [2.625, 2.625, 2.46875, 2.59375, 2.65625]
model.layers.17.self_attn.k_proj.weight         torch.Size([640, 2560])    1638400   [1.40625, -7.96875, -1.703125, -1.421875, 1.2109375]
model.layers.17.self_attn.o_proj.weight         torch.Size([2560, 2560])   6553600   [-2.234375, 1.609375, 4.3125, -3.484375, -1.5]
model.layers.17.self_attn.q_proj.weight         torch.Size([2560, 2560])   6553600   [0.6640625, -0.9765625, 0.76953125, 0.890625, 0.9765625]
model.layers.17.self_attn.v_proj.weight         torch.Size([640, 2560])    1638400   [-6.375, 2.140625, 2.296875, -14.125, 3.375]
model.layers.18.input_layernorm.weight          torch.Size([2560])         2560      [0.01336669921875, 0.01531982421875, 0.01226806640625, 0.0147705078125, 0.0135498046875]
model.layers.18.mlp.down_proj.weight            torch.Size([2560, 6912])   17694720  [0.4453125, -12.75, -3.640625, 1.578125, 3.640625]
model.layers.18.mlp.ffn_sub_norm.weight         torch.Size([6912])         6912      [1.7578125, 4.84375, 4.40625, 3.890625, 3.71875]
model.layers.18.mlp.gate_proj.weight            torch.Size([6912, 2560])   17694720  [-1.3515625, -2.296875, -1.296875, 1.546875, 1.1484375]
model.layers.18.mlp.up_proj.weight              torch.Size([6912, 2560])   17694720  [1.375, 0.9375, 2.328125, 0.89453125, 1.09375]
model.layers.18.post_attention_layernorm.weight torch.Size([2560])         2560      [1.4140625, 1.5234375, 1.4609375, 1.546875, 1.375]
model.layers.18.self_attn.attn_sub_norm.weight  torch.Size([2560])         2560      [5.03125, 4.96875, 5.15625, 5.28125, 4.0625]
model.layers.18.self_attn.k_proj.weight         torch.Size([640, 2560])    1638400   [-0.031982421875, -3.390625, 1.109375, -1.2578125, -1.28125]
model.layers.18.self_attn.o_proj.weight         torch.Size([2560, 2560])   6553600   [1.5234375, -5.75, 1.0234375, -1.3203125, 1.3125]
model.layers.18.self_attn.q_proj.weight         torch.Size([2560, 2560])   6553600   [0.328125, -2.296875, -0.291015625, -0.0400390625, -0.71484375]
model.layers.18.self_attn.v_proj.weight         torch.Size([640, 2560])    1638400   [-7.46875, 0.52734375, 2.140625, -3.21875, 1.484375]
model.layers.19.input_layernorm.weight          torch.Size([2560])         2560      [0.013916015625, 0.01263427734375, 0.0146484375, 0.015380859375, 0.01409912109375]
model.layers.19.mlp.down_proj.weight            torch.Size([2560, 6912])   17694720  [0.21875, -1.1328125, -9.75, -8.625, 3.671875]
model.layers.19.mlp.ffn_sub_norm.weight         torch.Size([6912])         6912      [2.90625, 3.109375, 4.65625, 6.25, 5.375]
model.layers.19.mlp.gate_proj.weight            torch.Size([6912, 2560])   17694720  [-2.609375, 1.296875, 1.1875, 7.0, -10.0]
model.layers.19.mlp.up_proj.weight              torch.Size([6912, 2560])   17694720  [0.98046875, -1.03125, -6.875, 1.2578125, -7.8125]
model.layers.19.post_attention_layernorm.weight torch.Size([2560])         2560      [1.421875, 1.53125, 1.4296875, 1.5390625, 1.3515625]
model.layers.19.self_attn.attn_sub_norm.weight  torch.Size([2560])         2560      [3.234375, 3.296875, 3.125, 3.34375, 3.3125]
model.layers.19.self_attn.k_proj.weight         torch.Size([640, 2560])    1638400   [1.1875, -2.0625, -1.265625, -1.15625, 1.1953125]
model.layers.19.self_attn.o_proj.weight         torch.Size([2560, 2560])   6553600   [-3.078125, 3.75, -1.375, 0.5390625, -4.84375]
model.layers.19.self_attn.q_proj.weight         torch.Size([2560, 2560])   6553600   [-0.8046875, -0.9609375, -0.82421875, -0.462890625, 0.8125]
model.layers.19.self_attn.v_proj.weight         torch.Size([640, 2560])    1638400   [-2.21875, -0.078125, 6.53125, -9.5625, 14.5]
model.layers.20.input_layernorm.weight          torch.Size([2560])         2560      [0.01287841796875, 0.01409912109375, 0.013916015625, 0.01422119140625, 0.01226806640625]
model.layers.20.mlp.down_proj.weight            torch.Size([2560, 6912])   17694720  [-1.2109375, -1.09375, -1.71875, -0.93359375, 1.25]
model.layers.20.mlp.ffn_sub_norm.weight         torch.Size([6912])         6912      [3.9375, 5.4375, 6.75, 1.9375, 4.96875]
model.layers.20.mlp.gate_proj.weight            torch.Size([6912, 2560])   17694720  [-0.94140625, 1.1640625, 1.15625, -1.15625, -1.265625]
model.layers.20.mlp.up_proj.weight              torch.Size([6912, 2560])   17694720  [-1.3359375, 2.65625, -0.65625, 1.59375, -2.0625]
model.layers.20.post_attention_layernorm.weight torch.Size([2560])         2560      [1.4140625, 1.484375, 1.4609375, 1.546875, 1.3671875]
model.layers.20.self_attn.attn_sub_norm.weight  torch.Size([2560])         2560      [4.84375, 4.53125, 4.625, 4.34375, 4.34375]
model.layers.20.self_attn.k_proj.weight         torch.Size([640, 2560])    1638400   [-1.6640625, -1.4609375, 0.63671875, -1.4921875, 1.609375]
model.layers.20.self_attn.o_proj.weight         torch.Size([2560, 2560])   6553600   [-1.0078125, -4.09375, 2.734375, 6.6875, -1.234375]
model.layers.20.self_attn.q_proj.weight         torch.Size([2560, 2560])   6553600   [-0.9140625, -0.62890625, -0.91796875, -0.8359375, -0.97265625]
model.layers.20.self_attn.v_proj.weight         torch.Size([640, 2560])    1638400   [9.6875, -4.09375, 2.109375, -3.640625, -1.9765625]
model.layers.21.input_layernorm.weight          torch.Size([2560])         2560      [0.012939453125, 0.01312255859375, 0.01312255859375, 0.0137939453125, 0.01312255859375]
model.layers.21.mlp.down_proj.weight            torch.Size([2560, 6912])   17694720  [7.03125, -13.4375, -1.4140625, -2.21875, -3.234375]
model.layers.21.mlp.ffn_sub_norm.weight         torch.Size([6912])         6912      [3.78125, 8.625, 3.703125, 5.21875, 6.96875]
model.layers.21.mlp.gate_proj.weight            torch.Size([6912, 2560])   17694720  [-0.85546875, -2.375, -0.296875, 4.65625, -1.203125]
model.layers.21.mlp.up_proj.weight              torch.Size([6912, 2560])   17694720  [1.7265625, -1.9140625, 7.4375, -1.46875, -0.7890625]
model.layers.21.post_attention_layernorm.weight torch.Size([2560])         2560      [1.421875, 1.484375, 1.4453125, 1.5234375, 1.359375]
model.layers.21.self_attn.attn_sub_norm.weight  torch.Size([2560])         2560      [5.53125, 5.28125, 5.5, 5.65625, 5.5]
model.layers.21.self_attn.k_proj.weight         torch.Size([640, 2560])    1638400   [-1.5390625, -8.4375, 0.46875, -1.390625, -1.1796875]
model.layers.21.self_attn.o_proj.weight         torch.Size([2560, 2560])   6553600   [1.140625, 1.4375, 1.296875, 1.234375, 1.1484375]
model.layers.21.self_attn.q_proj.weight         torch.Size([2560, 2560])   6553600   [0.447265625, 0.82421875, -0.42578125, 1.09375, 0.062255859375]
model.layers.21.self_attn.v_proj.weight         torch.Size([640, 2560])    1638400   [5.96875, 1.9140625, -1.203125, -1.90625, -1.9140625]
model.layers.22.input_layernorm.weight          torch.Size([2560])         2560      [0.01483154296875, 0.01373291015625, 0.01513671875, 0.01458740234375, 0.01556396484375]
model.layers.22.mlp.down_proj.weight            torch.Size([2560, 6912])   17694720  [2.0, 6.34375, 4.09375, -5.46875, 1.4375]
model.layers.22.mlp.ffn_sub_norm.weight         torch.Size([6912])         6912      [4.59375, 4.625, 10.4375, 3.03125, 4.875]
model.layers.22.mlp.gate_proj.weight            torch.Size([6912, 2560])   17694720  [5.3125, 3.6875, 2.515625, -2.796875, 1.203125]
model.layers.22.mlp.up_proj.weight              torch.Size([6912, 2560])   17694720  [-6.125, -4.875, -1.5859375, 1.5, 1.1328125]
model.layers.22.post_attention_layernorm.weight torch.Size([2560])         2560      [1.4296875, 1.515625, 1.4375, 1.5546875, 1.40625]
model.layers.22.self_attn.attn_sub_norm.weight  torch.Size([2560])         2560      [4.65625, 4.25, 4.46875, 2.65625, 4.15625]
model.layers.22.self_attn.k_proj.weight         torch.Size([640, 2560])    1638400   [1.765625, 0.9140625, -0.1728515625, 1.1875, -2.03125]
model.layers.22.self_attn.o_proj.weight         torch.Size([2560, 2560])   6553600   [1.2265625, -3.921875, -1.2578125, -1.8515625, -1.28125]
model.layers.22.self_attn.q_proj.weight         torch.Size([2560, 2560])   6553600   [-0.546875, -0.5390625, -3.375, 0.75390625, -0.03955078125]
model.layers.22.self_attn.v_proj.weight         torch.Size([640, 2560])    1638400   [-0.416015625, -1.1875, 10.3125, 1.890625, -4.5625]
model.layers.23.input_layernorm.weight          torch.Size([2560])         2560      [0.01324462890625, 0.01300048828125, 0.0128173828125, 0.01416015625, 0.01470947265625]
model.layers.23.mlp.down_proj.weight            torch.Size([2560, 6912])   17694720  [-1.0, 1.4609375, 0.003875732421875, -0.77734375, -13.4375]
model.layers.23.mlp.ffn_sub_norm.weight         torch.Size([6912])         6912      [4.59375, 5.78125, 7.71875, 8.625, 10.5625]
model.layers.23.mlp.gate_proj.weight            torch.Size([6912, 2560])   17694720  [6.875, -1.1953125, -1.203125, -1.5703125, -1.4140625]
model.layers.23.mlp.up_proj.weight              torch.Size([6912, 2560])   17694720  [-12.5, 2.09375, -1.125, 4.125, 0.7578125]
model.layers.23.post_attention_layernorm.weight torch.Size([2560])         2560      [1.484375, 1.5390625, 1.4609375, 1.5859375, 1.4375]
model.layers.23.self_attn.attn_sub_norm.weight  torch.Size([2560])         2560      [6.5625, 6.09375, 6.3125, 6.28125, 6.65625]
model.layers.23.self_attn.k_proj.weight         torch.Size([640, 2560])    1638400   [-1.3125, 1.3359375, -1.3984375, -1.3046875, 0.81640625]
model.layers.23.self_attn.o_proj.weight         torch.Size([2560, 2560])   6553600   [5.3125, -1.359375, 11.0625, -0.9375, 1.40625]
model.layers.23.self_attn.q_proj.weight         torch.Size([2560, 2560])   6553600   [0.8359375, 0.44140625, 0.48046875, -2.421875, -2.15625]
model.layers.23.self_attn.v_proj.weight         torch.Size([640, 2560])    1638400   [-8.75, 1.828125, -7.15625, 1.953125, -1.8515625]
model.layers.24.input_layernorm.weight          torch.Size([2560])         2560      [0.0130615234375, 0.01190185546875, 0.01422119140625, 0.013671875, 0.01470947265625]
model.layers.24.mlp.down_proj.weight            torch.Size([2560, 6912])   17694720  [1.71875, -1.453125, 5.25, -1.4609375, 10.875]
model.layers.24.mlp.ffn_sub_norm.weight         torch.Size([6912])         6912      [5.0625, 5.59375, 7.3125, 8.0625, 8.3125]
model.layers.24.mlp.gate_proj.weight            torch.Size([6912, 2560])   17694720  [5.15625, 5.0, -3.265625, 1.1484375, 1.890625]
model.layers.24.mlp.up_proj.weight              torch.Size([6912, 2560])   17694720  [1.09375, 1.109375, -1.4296875, 0.049072265625, 1.8828125]
model.layers.24.post_attention_layernorm.weight torch.Size([2560])         2560      [1.4921875, 1.5078125, 1.4921875, 1.5390625, 1.4453125]
model.layers.24.self_attn.attn_sub_norm.weight  torch.Size([2560])         2560      [8.0, 8.4375, 7.8125, 7.90625, 7.34375]
model.layers.24.self_attn.k_proj.weight         torch.Size([640, 2560])    1638400   [0.71875, -0.85546875, 1.6640625, -1.5625, -0.2412109375]
model.layers.24.self_attn.o_proj.weight         torch.Size([2560, 2560])   6553600   [-1.3984375, 1.390625, 1.3828125, -6.40625, 9.0625]
model.layers.24.self_attn.q_proj.weight         torch.Size([2560, 2560])   6553600   [0.023193359375, -0.80859375, -0.302734375, -0.67578125, -0.953125]
model.layers.24.self_attn.v_proj.weight         torch.Size([640, 2560<
10000
span class="pl-kos">])    1638400   [-3.90625, -7.78125, -13.125, 9.0625, 1.859375]
model.layers.25.input_layernorm.weight          torch.Size([2560])         2560      [0.0172119140625, 0.01513671875, 0.0157470703125, 0.01953125, 0.017333984375]
model.layers.25.mlp.down_proj.weight            torch.Size([2560, 6912])   17694720  [1.2890625, 0.72265625, 0.443359375, -11.3125, 1.46875]
model.layers.25.mlp.ffn_sub_norm.weight         torch.Size([6912])         6912      [7.34375, 4.03125, 3.921875, 5.90625, 7.5625]
model.layers.25.mlp.gate_proj.weight            torch.Size([6912, 2560])   17694720  [-1.3125, 0.703125, 1.703125, -2.34375, -1.3828125]
model.layers.25.mlp.up_proj.weight              torch.Size([6912, 2560])   17694720  [-3.453125, 0.8984375, -4.375, -4.84375, -9.8125]
model.layers.25.post_attention_layernorm.weight torch.Size([2560])         2560      [1.4609375, 1.5078125, 1.4296875, 1.53125, 1.390625]
model.layers.25.self_attn.attn_sub_norm.weight  torch.Size([2560])         2560      [6.6875, 5.71875, 7.28125, 7.21875, 8.5625]
model.layers.25.self_attn.k_proj.weight         torch.Size([640, 2560])    1638400   [-1.15625, -1.171875, -3.75, 1.328125, 1.1796875]
model.layers.25.self_attn.o_proj.weight         torch.Size([2560, 2560])   6553600   [-1.0390625, 1.4140625, 1.359375, -2.40625, 1.0390625]
model.layers.25.self_attn.q_proj.weight         torch.Size([2560, 2560])   6553600   [-1.1015625, -1.59375, 0.75390625, 0.64453125, -0.12890625]
model.layers.25.self_attn.v_proj.weight         torch.Size([640, 2560])    1638400   [-1.671875, -1.6875, -4.15625, -3.09375, -1.6796875]
model.layers.26.input_layernorm.weight          torch.Size([2560])         2560      [0.0150146484375, 0.013916015625, 0.01544189453125, 0.015625, 0.01556396484375]
model.layers.26.mlp.down_proj.weight            torch.Size([2560, 6912])   17694720  [-3.65625, 1.3671875, 0.76953125, -2.234375, 1.2265625]
model.layers.26.mlp.ffn_sub_norm.weight         torch.Size([6912])         6912      [5.46875, 11.3125, 9.125, 6.78125, 7.0]
model.layers.26.mlp.gate_proj.weight            torch.Size([6912, 2560])   17694720  [1.109375, -0.55078125, 3.875, -1.203125, 4.125]
model.layers.26.mlp.up_proj.weight              torch.Size([6912, 2560])   17694720  [2.453125, -4.65625, 0.185546875, 1.1875, 0.056396484375]
model.layers.26.post_attention_layernorm.weight torch.Size([2560])         2560      [1.453125, 1.4453125, 1.453125, 1.546875, 1.453125]
model.layers.26.self_attn.attn_sub_norm.weight  torch.Size([2560])         2560      [9.6875, 9.0, 9.0, 9.125, 9.5]
model.layers.26.self_attn.k_proj.weight         torch.Size([640, 2560])    1638400   [0.5234375, -1.265625, -1.0859375, 1.390625, -1.21875]
model.layers.26.self_attn.o_proj.weight         torch.Size([2560, 2560])   6553600   [0.88671875, 8.375, -1.421875, 3.5625, -4.875]
model.layers.26.self_attn.q_proj.weight         torch.Size([2560, 2560])   6553600   [0.37890625, -0.8203125, -0.7890625, 0.66015625, 1.21875]
model.layers.26.self_attn.v_proj.weight         torch.Size([640, 2560])    1638400   [9.625, 1.625, 17.875, 1.7421875, -1.4921875]
model.layers.27.input_layernorm.weight          torch.Size([2560])         2560      [0.015869140625, 0.01556396484375, 0.0169677734375, 0.017578125, 0.0167236328125]
model.layers.27.mlp.down_proj.weight            torch.Size([2560, 6912])   17694720  [13.625, -0.0115966796875, 0.349609375, -1.40625, -1.2109375]
model.layers.27.mlp.ffn_sub_norm.weight         torch.Size([6912])         6912      [8.875, 7.375, 8.375, 2.765625, 3.78125]
model.layers.27.mlp.gate_proj.weight            torch.Size([6912, 2560])   17694720  [-1.2578125, 1.265625, -0.78125, -1.234375, 1.640625]
model.layers.27.mlp.up_proj.weight              torch.Size([6912, 2560])   17694720  [2.109375, 3.375, 1.09375, 3.25, -6.09375]
model.layers.27.post_attention_layernorm.weight torch.Size([2560])         2560      [1.5390625, 1.5390625, 1.5234375, 1.609375, 1.5078125]
model.layers.27.self_attn.attn_sub_norm.weight  torch.Size([2560])         2560      [9.9375, 9.8125, 10.375, 10.1875, 10.125]
model.layers.27.self_attn.k_proj.weight         torch.Size([640, 2560])    1638400   [-1.2734375, -1.296875, -1.2890625, 3.71875, -0.9921875]
model.layers.27.self_attn.o_proj.weight         torch.Size([2560, 2560])   6553600   [-0.74609375, 5.46875, 1.328125, -3.65625, -0.90234375]
model.layers.27.self_attn.q_proj.weight         torch.Size([2560, 2560])   6553600   [-0.7890625, 0.203125, 0.205078125, 0.55078125, 0.76953125]
model.layers.27.self_attn.v_proj.weight         torch.Size([640, 2560])    1638400   [1.8515625, 1.859375, -1.953125, 4.25, 1.28125]
model.layers.28.input_layernorm.weight          torch.Size([2560])         2560      [0.021240234375, 0.01556396484375, 0.0181884765625, 0.0206298828125, 0.0194091796875]
model.layers.28.mlp.down_proj.weight            torch.Size([2560, 6912])   17694720  [2.015625, -1.4140625, 5.84375, 1.2890625, -0.455078125]
model.layers.28.mlp.ffn_sub_norm.weight         torch.Size([6912])         6912      [4.0, 12.25, 12.0625, 10.4375, 4.4375]
model.layers.28.mlp.gate_proj.weight            torch.Size([6912, 2560])   17694720  [-1.3359375, -9.8125, -0.94921875, 1.6015625, -0.88671875]
model.layers.28.mlp.up_proj.weight              torch.Size([6912, 2560])   17694720  [-5.375, -10.8125, -4.15625, 5.4375, -1.9140625]
model.layers.28.post_attention_layernorm.weight torch.Size([2560])         2560      [1.5390625, 1.515625, 1.484375, 1.5234375, 1.484375]
model.layers.28.self_attn.attn_sub_norm.weight  torch.Size([2560])         2560      [11.6875, 7.625, 13.0, 11.375, 11.4375]
model.layers.28.self_attn.k_proj.weight         torch.Size([640, 2560])    1638400   [0.3359375, 1.03125, -0.57421875, -0.765625, 1.265625]
model.layers.28.self_attn.o_proj.weight         torch.Size([2560, 2560])   6553600   [2.140625, 3.1875, 0.9296875, -0.92578125, 0.6953125]
model.layers.28.self_attn.q_proj.weight         torch.Size([2560, 2560])   6553600   [-0.470703125, 0.6171875, 0.609375, 2.546875, -0.376953125]
model.layers.28.self_attn.v_proj.weight         torch.Size([640, 2560])    1638400   [4.5, 1.5078125, -4.21875, 5.21875, -2.8125]
model.layers.29.input_layernorm.weight          torch.Size([2560])         2560      [0.0233154296875, 0.02490234375, 0.0216064453125, 0.0186767578125, 0.021728515625]
model.layers.29.mlp.down_proj.weight            torch.Size([2560, 6912])   17694720  [8.8125, -1.140625, 1.015625, -1.3984375, -2.96875]
model.layers.29.mlp.ffn_sub_norm.weight         torch.Size([6912])         6912      [13.875, 12.25, 4.5625, 6.84375, 17.25]
model.layers.29.mlp.gate_proj.weight            torch.Size([6912, 2560])   17694720  [-1.1171875, -0.92578125, 2.90625, 1.3359375, 1.2109375]
model.layers.29.mlp.up_proj.weight              torch.Size([6912, 2560])   17694720  [1.4609375, 7.75, 0.357421875, -1.3203125, -0.99609375]
model.layers.29.post_attention_layernorm.weight torch.Size([2560])         2560      [1.265625, 1.3125, 1.2578125, 1.1015625, 1.28125]
model.layers.29.self_attn.attn_sub_norm.weight  torch.Size([2560])         2560      [-14.0, 13.0, 11.0625, 12.1875, -7.869675755500793e-08]
model.layers.29.self_attn.k_proj.weight         torch.Size([640, 2560])    1638400   [0.384765625, -0.470703125, -4.125, 1.0625, -0.359375]
model.layers.29.self_attn.o_proj.weight         torch.Size([2560, 2560])   6553600   [0.49609375, -1.6796875, -1.59375, -0.173828125, 5.401670932769775e-07]
model.layers.29.self_attn.q_proj.weight         torch.Size([2560, 2560])   6553600   [0.57421875, -1.125, 0.5234375, -0.5703125, 0.74609375]
model.layers.29.self_attn.v_proj.weight         torch.Size([640, 2560])    1638400   [-1.1484375, -1.15625, 4.25, 0.416015625, -1.28125]
model.norm.weight                               torch.Size([2560])         2560      [0.10302734375, 0.1005859375, 0.10205078125, 0.16015625, 0.09228515625]

Todo

About

Minimal implementation of Microsoft's BitNet b1.58 2B4T

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0