A collection of useful tools from Natural Language Processing, wrapped in an API in the form of a Docker container.
Currently includes:
- Language identification β using the fastText package with its pre-trained model for recognizing 176 languages.
- Sentence segmentation β based on tokenizers from NLTK.
- Extraction of the main content from html documents β using the Trafilatura package.
- Removal of fuzzy application duplicates β using Levenshtein distance and the Levenshtein package.
To use, install the latest version of Docker Engine.
Clone the repo:
git clone https://github.com/kinamoroll/NLPUtilCollection.git
Or download it via this link as a zip archive (don't forget to unzip the archive).
And start the container:
cd NLPUtilCollection
docker compose up --build -d
To stop the container, navigate to the folder where NLPUtilCollection
is cloned and run:
docker compose stop
After launching on the same machine, you can send requests to the container for checking (curl should be installed for checking):
# checking language identification:
curl -v -XPOST -d 'text=some+useful+info' http://127.0.0.1:9090/detect
# checking tokenization:
curl -v -XPOST -d 'text=Test+sent%3F+Don%27t+or+ms.+Not%21+Yes%2C+of+course.+Maybe+mr.Jeck+and+band.&lang=en' http://127.0.0.1:9090/tokenize
# extraction of text from an html document:
curl -v XPOST -d 'html=%3Chtml%3E%3Cbody%3E%3Ch1%3Etest%3C%2Fh1%3E%3Cp%3Ethis%20is%20test%3C%2Fp%3E%3C%2Fbody%3E%3C%2Fhtml%3E' http://127.0.0.1:9090/extract
# deletion of sentence duplicates:
curl -v XPOST -d '{"sentences": ["1 sentence", "2 sentence", "Another sentence"], "threshold": 0.8}' http://127.0.0.1:9090/deduplicate
To check from another server, you need to change the IP address and ensure the 9090
port is not closed in the built-in firewall.
All endpoints only process requests with the POST
HTTP method.
API Endpoint: /detect
Supports the following input parameters:
text
β a string with the text for which you want to identify the language;count
β number of results. By default:3
.
As a result, there will be JSON in the form of an array of dictionaries:
[
{
"confidence": 0.5937589406967163,
"code": "en",
"name": "English",
"family": "Indo-European",
"endonym": "English",
"iso639-1": "en",
"iso639-2/T": "eng",
"iso639-2/B": "eng",
"iso639-3": "eng"
}
]
API Endpoint: /tokenize
Supports the following input parameters:
text
β a string with text that needs to be broken down into sentences;lang
β text language code. By default:en
.
Supported languages for tokenization:
{
"en": "english",
"ru": "russian",
"cs": "czech",
"da": "danish",
"nl": "dutch",
"et": "estonian",
"fi": "finnish",
"fr": "french",
"de": "german",
"el": "greek",
"it": "italian",
"ml": "malayalam",
"no": "norwegian",
"pl": "polish",
"pt": "portuguese",
"sl": "slovene",
"es": "spanish",
"sv": "swedish",
"tr": "turkish"
}
As a result, there will be JSON in the form of an array of strings:
[
"Test sent?",
"Don't or ms. Not!",
"Yes, of course.",
"Maybe mr.Jeck and band."
]
API Endpoint: /extract
This supports only one input parameter:
html
β the content of an HTML page, encoded using theurlencode
function (the page needs to be downloaded independently).
It's very important to encode the transmitted page in URL-encoding
format because if there is no encoding, the parser will process only part of the page (up to the first symbol &
)!
As a result, the main content of the page will be returned without html tags.
API Endpoint: /deduplicate
βοΈ Please note that only JSON
is acceptable.
The following keys with values must be inside json
:
sentences
β an array of strings (sentences) from which duplicates are to be deleted;threshold
β a threshold value in the interval[0.0, 1.0]
, triggering which considers the sentence as a fuzzy duplicate (the parameter is optional, by default:0.8
).
As a result, there will be JSON in the form of an array of strings:
[
"2 sentence",
"Another sentence"
]
Kinamoroll
Blog: https://t.me/Kinamoroll