8000 GitHub - greshny-attic/diffbot: The ruby Diffbot API client
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on May 14, 2023. It is now read-only.

greshny-attic/diffbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Diffbot

This is a ruby client for the Diffbot API.

Gem Version Build Status Code Climate Test Coverage

Install

Get the latest version from RubyGems:

$ gem install diffbot

Global Options

You can pass some settings to Diffbot like this:

Diffbot.configure do |config|
  config.token = ENV["DIFFBOT_TOKEN"]
  config.instrumentor = ActiveSupport::Notifications
end

The list of supported settings is:

  • token: Your Diffbot API token. This will be used for all requests in which you don't specify it manually (see below).
  • instrumentor: An object that matches the ActiveSupport::Notifications API, which will be used to trace network events. None is used by default.
  • article_defaults: Pass a block to this method to configure the global request settings used for Diffbot::Article requests. See below the options supported.

Articles

In order to fetch an article, do this:

require "diffbot"

article = Diffbot::Article.fetch(article_url, diffbot_token)

# Now you can inspect the result:
article.title
article.author
article.date
article.text
# etc. See below for the full list of available response attributes.

This is a list of all the fields returned by the Diffbot::Article.fetch call:

  • url: The URL of the article.
  • title: The title of the article.
  • author: The author of the article.
  • date: The date in which this article was published.
  • media: A list of media items attached to this article.
  • text: The body of the article. This will be plain text unless you specify the HTML option in the request.
  • tags: A list of tags/keywords extracted from the article.
  • xpath: The XPath at which this article was found in the page.
  • human_language: Returns the (spoken/human) language of the submitted URL, using two-letter ISO 639-1 nomenclature.
  • num_pages: Number of pages automatically concatenated to form the text or html response.
  • images: Array of images, if present within the article body.
    • url: Direct (fully resolved) link to image.
    • pixel_height: Image height, in pixels.
    • pixel_width: Image width, in pixels.
    • caption: Diffbot-determined best caption for the image, if detected.
    • primary: Returns 'true' if image is identified as primary.
  • videos: Array of videos, if present within the article body.
    • url: Direct (fully resolved) link to the video content.
    • pixel_height: Video height, in pixels, if accessible.
    • pixel_width: Video width, in pixels, if accessible.
    • primary: Returns "true" if the video is identified as primary.

Options

You can customize your request like this:

article = Diffbot::Article.fetch(article_url, diffbot_token) do |request|
  request.html = true # Return HTML instead of plain text.
  request.dont_strip_ads = true # Leave any inline ads within the article.
  request.tags = true # Generate ads for the article.
  request.comments = true # Extract the comments from the article as well.
  request.summary = true # Return a summary text instead of the full text.
  request.stats = true # Return performance, probabilistic scoring stats.
end

Frontpages

In order to fetch and analyze a front page, do this:

require "diffbot"

frontpage = Diffbot::Frontpage.fetch(url, diffbot_token)

# Results are available in the returned object:
frontpage.title
frontpage.icon
frontpage.items #=> An array of Diffbot::Item instances

The fields you can extract from a Frontpage are:

  • title: The title of the page.
  • icon: The favicon of the page.
  • source_type: What kind of page this is.
  • source_url: The URL of the page.
  • items: The list of Diffbot::Item representing each item on the page.

The instances of Diffbot::Item have the following fields:

  • id: Unique identifier for this item.
  • title: Title of the item.
  • link: Extracted permalink of the item (if applicable).
  • description: innerHTML content of the item.
  • summary: A plain-text summary of the item.
  • pub_date: Date when item was detected on page.
  • type: The type of item, according to Diffbot. One of: IMAGE, LINK, STORY, CHUNK.
  • img: The main image extracted from this item.
  • xroot: XPath of where the item was found on the page.
  • cluster: XPath of the cluster of items where this item was found.
  • stats: An object with the following attributes:
    • spam_score: A Float between 0.0 and 1.0 indicating the probability this item is spam/an advertisement.
    • static_rank: A Float between 1.0 and 5.0 indicating the quality score of the item.
    • fresh: The percentage of the item that has changed compared to the previous crawl.

Products

In order to fetch a product, do this:

require "diffbot"

product = Diffbot::Product.fetch(article_url, diffbot_token)

# Now you can inspect the result:
product.products
product.type
product.url
# etc. See below for the full list of available response attributes.

This is a list of all the fields returned by the Diffbot::Product.fetch call:

  • breadcrumb: an array of link URLs and text from page breadcrumbs
    • name: text
    • link: an URL
  • date_created: date of publishing product
  • type: response type
  • products: array of products
    • title: name of the product
    • description: description, if available, of the product
    • offer_price: identified offer or actual/'final' price
    • product_id: unique product's id
    • availability: item's availability, either true or false
    • offer_price_details: price details
      • amount:
      • text:
      • symbol:
    • media: array of media items (images or videos) of the product.
      • primary: only images, returns true if image is identified as primary
      • link: link to image or video content.
      • caption: caption for the image.
      • type: type of media identified (image or video).
      • height: image height, in pixels.
      • width: image width, in pixels.
      • xpath: full document Xpath to the media item.

TODO

  • Implement the Follow API.
  • Add tests for Article and Frontpage requests.
  • Add a Frontpage.crawl method that given the URL of a frontpage, it will fetch the article for each item in the page.

License

This is published under an MIT License, see LICENSE for further details.

About

The ruby Diffbot API client

Topics

Resources

License

Stars

Watchers

Forks

Packages

3B3D
No packages published

Contributors 5

Languages

0