Smart AI Web Scraping with Nukitori and Rubium/Kimurai
/ 4 min read
Table of Contents
Web scraping often means writing brittle XPath or CSS selectors that break whenever a website updates its layout. What if you could just describe the data you want, and let AI figure out how to extract it?
That’s exactly what Nukitori does. Combined with Rubium for browser automation, you get a powerful scraping setup that’s both flexible and maintainable.
How Nukitori Works
Nukitori uses an LLM to analyze HTML and generate XPath extraction rules. The clever part: it only calls the AI once per page type, then saves the generated schema for reuse. Subsequent extractions are pure XPath - fast and free.
Setup
Add the gems to your Gemfile:
gem 'nukitori'gem 'rubium'Configure Nukitori with your LLM provider:
require 'nukitori'require 'rubium'
Nukitori.configure do |config| config.default_model = 'gemini-3-flash-preview' config.gemini_api_key = '<GEMINI_API_KEY>'endNukitori supports multiple providers: OpenAI, Anthropic, Gemini, DeepSeek, and any OpenAI-compatible API.
Scraping GitHub Search Results
Here’s a complete example that scrapes GitHub repository search results with automatic pagination - without writing a single CSS selector or XPath manually:
browser = Rubium::Browser.newnext_page_url = 'https://github.com/search?q=kimurai&type=repositories'repos = []
while next_page_url browser.visit(next_page_url) ; sleep 2
data = Nukitori(browser.current_response, 'repos_schema.json') do string :next_page_url, description: 'Next page path url' array :repos do object do string :name string :description, description: 'Repository short description' string :url string :stars string :language array :tags, of: :string end end end
repos.concat(data['repos']) next_page_url = data['next_page_url']end
File.write('repos.json', JSON.pretty_generate(repos))Understanding the Schema DSL
The block passed to Nukitori() defines the data structure you want to extract:
string :field_name- extracts a text valuearray :items do ... end- extracts a list of objectsobject do ... end- defines nested structuredescription: '...'- hints for the AI about what to look for
On first run, Nukitori sends the HTML and schema to the LLM, which returns XPath rules. These rules are saved to repos_schema.json and reused for all subsequent pages.
How Pagination Works
Notice the next_page_url field in the schema. On the first page, Nukitori’s AI analyzes the HTML and extracts the XPath for the “Next” pagination link. This XPath gets saved to the schema file along with everything else.
On every subsequent page, the same XPath is applied to fresh HTML. As long as there’s a “Next” link on the page, next_page_url returns its href. When you reach the last page and there’s no “Next” element, the XPath returns nil - and the while loop exits.
Zero manual selectors. The AI figured out where the pagination link lives, and that knowledge is reused for the entire crawl.
Why This Approach
- Zero selectors - Describe what you want, not how to find it
- Fast after first run - Cached XPath rules mean no more API calls
- Self-handling pagination - Next page links are just another field in your schema
- Provider agnostic - Switch between OpenAI, Anthropic, or Gemini with one config change
The generated schema file is human-readable JSON. You can inspect it, tweak the XPaths manually if needed, or version control it alongside your scraper.
Using Nukitori with Kimurai Framework
If you’re building production scrapers, Kimurai is a full-featured web scraping framework for Ruby. Nukitori is integrated as a first-class citizen - just use the extract helper method with the same schema DSL.
require 'kimurai'
# Configure once in your application:Kimurai.configure do |config| config.default_model = 'gemini-3-flash-preview' config.gemini_api_key = ENV['GEMINI_API_KEY']end
# Then use `extract` method in any spider:class GithubSpider < Kimurai::Base @start_urls = ["https://github.com/search?q=kimurai&type=repositories"] @engine = :chrome @delay = 2
def parse(response, url:, data: {}) data = extract(response) do string :next_page_url, description: 'Next page path url' array :repos do object do string :name string :description, description: 'Repository short description' string :url string :stars string :language array :tags, of: :string end end end
save_to "results.json", data[:repos], format: :json
if data[:next_page_url] request_to :parse, url: absolute_url(data[:next_page_url], base: url) end endend
GithubSpider.crawl!Same zero-selector approach, but with Kimurai’s built-in request queuing, rate limiting, data persistence, and browser management. The extract method handles schema caching automatically based on the spider name.