note株式会社が提供するサービスnote記事を分析するためのリポジトリです。
This repository analyzes xml files that contain articles exported from the note service.
-
setup virtual environment and install dependencies
uv sync uv run python -m ensurepip --upgrade uv run python -m spacy download ja_core_news_sm uv run python -m spacy download ja_core_news_md
-
migrate database
# migration uv run alembic upgrade head
-
export articles from note service as xml files
-
put xml files in
datasets
directory -
save articles to database running the following command
uv run python -m src.load_xml --xml datasets/*.xml
- article table
Column Name | Type | Description |
---|---|---|
id | text | 記事のURL(pkey) |
creator | text | 作成者のユーザ名 |
pub_date | text | 記事の公開日時 |
post_date | text | 記事の投稿日時 |
title | text | 記事のタイトル |
body_md | text | markdown形式の記事の本文 |
body_html | text | HTML形式の記事の本文 |
post_id | int | 記事のID |
post_type | text | 記事のタイプ |
status | text | 記事のステータス |
- pos_ngram_similarity table
Column Name | Type | Description |
---|---|---|
id | int | ID(pkey) |
article_id_a | int | 記事のID |
article_id_b | int | 記事のID |
model | text | 使用したモデル名 |
ngram_size | int | ngramのサイズ |
embedding_method | text | POSの埋め込み方法 |
ngram_similarity | text | pos_ngramの類似度 |
To analyze the similarity of writing styles in articles, you can use the pos_ngram_similarity
module.
This module computes the similarity between articles based on their POS n-grams.
The script automatically:
- Skips similarity calculations for article pairs that already exist in the database
- Calculates similarities only for new article combinations
- Creates visualizations (heatmap and distribution plots) after processing
uv run python -m src.pos_ngram_similarity \
--xml datasets/*.xml \
--model ja_core_news_md \
--n 2 \
--embedding_type bow \
--creator <creator_name>
Results will be saved in pos_ngram_similarity
table and visualization files will be generated in the output directory.