This script (collect_data.py
) is designed to process and match supplier price lists with products in the store. It extracts data, normalizes it, utilizes SentenceTransformer
to find the most similar products, and then generates an Excel table.
Section | Description |
---|---|
Functionality | Detailed breakdown of the script's capabilities and processing steps |
Technologies Used | List of technologies utilized in the project |
Installation and Execution | Step-by-step guide on setting up and running the script |
How Does SentenceTransformer Work? |
Explanation of how SentenceTransformer is used for matching products based on semantic similarity |
Project Structure | Folder structure explanation |
Usage Presentation | Presentation of a worked script |
Trained Model | Link to the fine-tuned ML model |
Contacts | Developer contact information |
-
Data Loading
- Loads supplier price lists from
Прайсы с телеграма 28.01.xlsx
. - Loads store products from
Товары магазина.xlsx
.
- Loads supplier price lists from
-
Data Cleaning and Normalization
- Extracts color (
get_color()
). - Extracts RAM and ROM (
get_ram()
,get_rom()
). - Extracts prices (
get_price()
). - Identifies manufacturer (
get_manufacturer()
).
- Extracts color (
-
Matching Supplier Products with Store Products
- Encodes product names using
SentenceTransformer
. - Computes cosine similarity between products.
- Filters matches with accuracy above 90%.
- Encodes product names using
-
Generating an Excel Table (
output_prices.xlsx
)- Groups products by name.
- Lists prices and suppliers for each product.
git clone https://github.com/arielen/test_Hatiko_tech2.git
cd test_Hatiko_tech2
pip install -r requirements.txt
Place Прайсы с телеграма 28.01.xlsx
and Товары магазина.xlsx
in the same directory.
python collect_data.py
After execution, the result will be saved as output_prices.xlsx
.
This script uses the fine_tuned_mpnet_v3
model for finding similarity between products:
- Encodes product names into vector representations.
- Compares supplier products with store products.
- Selects products with similarity above 90%.
✅ Example:
🔍 Query: iPhone 15 Pro Max 512GB
✅ Found: Apple iPhone 15 Pro 512GB (Similarity: 0.92)
📂 data-parsing/
├── 📜 collect_data.ipynb # Jupyter notebook for data collection
├── 📝 collect_data.py # Python script for data processing
├── 📂 fine_tuned_mpnet_v3 # Fine-tuned SentenceTransformer model
├── 📜 fine_tune.csv # CSV file for fine-tuning the model
├── 📜 learning.ipynb # Jupyter notebook for model training
├── 📝 learning.py # Python script for model fine-tuning
├── 📊 output_prices.xlsx # Generated Excel file with matched prices
├── 📜 README.md # Project documentation
├── 📜 requirements.txt # Dependencies file
├── 📜 Прайсы с телеграма.xlsx # Supplier price list
└── 📜 Товары магазина.xlsx # Store product list
The fine-tuned ML model used in this project is available on Hugging Face: Fine-Tuned MPNet v3
This script automates the collection and processing of supplier price lists, as well as generates a structured Excel table with matched products.
💻 Developer: arielen
📧 Email: pavlov_zv@mail.ru
📧 TG: 1 0