This project is a REST API service built with FastAPI that extracts and returns cleaned text from HTML content using the Goose3 library.
- Accepts HTML content via a POST request.
- Extracts and returns the main text content from the HTML.
- Simple and fast implementation using FastAPI and Goose3.
- Python 3.8+
- FastAPI
- Goose3
- Uvicorn
-
Clone the repository:
git clone https://github.com/rbehzadan/extract-text-api.git cd extract-text-api
-
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate
-
Install the dependencies:
pip install -r requirements.txt
Start the FastAPI application using Uvicorn:
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4
Send a POST request to /extract-text
with the HTML content in the request body:
POST /extract-text
Content-Type: application/json
{
"content": "<html><body><h1>Sample Article</h1><p>This is a sample paragraph.</p></body></html>"
}
The API will return a JSON response with the cleaned text:
{
"text": "Sample Article\nThis is a sample paragraph."
}
To run the application using Docker:
-
Build the Docker image:
docker build -t extract-text-api .
-
Run the Docker container:
docker run -d -p 8000:8080 extract-text-api
This project is licensed under the MIT License. See the LICENSE file for details.
Contributions are welcome! Please open an issue or submit a pull request for any changes.