4 comments

  • bobajeff 15 hours ago
    I had to look up what KDL is and what `Functional Source License, Version 1.1, ALv2 Future License` is.

    So KDL is like another JSON or Yaml. FSL-1.1-ALv2 is an, almost but not really, open source license that after a 2 years becomes available under a real open source license. It's to prevent free loading from companies or something. Sounds fine to me actually.

    • zachperkitny 14 hours ago
      Effectively, it's not meant to restrict people from using it, even in a commercial setting, just to protect my personal interests in what I want to do with it in a commercial setting.

      KDL is more than just JSON or YAML. It's node based. It's output in libraries is effectively an AST and its use cases are open ended.

  • zachperkitny 16 hours ago
    Hello!

    I wanted to share my recent project: Tadpole. It is a custom DSL built on top of KDL specifically for web scraping and browser automation. I wanted there to be a standardized way of writing scrapers and reusing existing scraper logic. This was my solution.

    Why?

        Abstraction: Simulating realistic human behavior (bezier curves, easing) through high-level composed actions.
        Zero Config: Import and share scraper modules directly via Git, bypass NPM/Registry overhead.
        Reusability: Actions and evaluators can be composed through slots to create more complex workflows.
    
    
    Example

    This is a fully running example, @tadpole/cli is published on npm:

    tadpole run redfin.kdl --input '{"text": "Seattle, WA"}' --auto --output output.json

      import "modules/redfin/mod.kdl" repo="github.com/tadpolehq/community"
    
      main {
        new_page {
          redfin.search text="=text"
          wait_until
          redfin.extract_from_card extract_to="addresses" {
            address {
              redfin.extract_address_from_card
            }
          }
        }
      }
    
    
    Roadmap? Planned for 0.2.0

        Control Flow: Add maybe (effectively try/catch) and loop (while {}, do {})
        DOMPick: Used to select elements by index
        DOMFilter: Used to filter elements using evaluators
        More Evaluators: Type casting, regex, exists
        Root Slots: Support for top level dynamic placeholders
        Error Reporting: More robust error reporting
        Logging: More consistent logging from actions and add log action to global registry
    
    0.3.0

        Piping: Allowing different files to chain input/output.
        Outputs: Complex output sinks to databases, s3, kafka, etc.
        DAGs: Use directed acylic graphs to create complex crawling scenarios and parallel compute.
    
    Github Repository: https://github.com/tadpolehq/tadpole

    I've also created a community repository for sharing scraper logic: https://github.com/tadpolehq/community

    Feedback would be greatly appreciated!

    • bobajeff 14 hours ago
      I like the idea of a DSL for scraping but my scrapers do more than extract text. I also download files (+monitor download progress) and intercept images (+ check for partial or failed to load images). So it seems my use case isn't really covered with this.
      • zachperkitny 14 hours ago
        Thanks for the idea actually! It's difficult to cover every use case in the 0.1.0 release. I'll take this into account. Downloading Files/Images could likely be abstracted into just an HTTP source and the data sources could be merged in some way.
  • olivia-banks 9 hours ago
    I really enjoy this as a concept! I run a service for away-ordering food from vendors at my University, and I need to do a lot of web-scraping to facilitate this. The code example on the website looks a lot cleaner than any of my BS4 code...! Even with the minimal primitives implemented, this seems useful.

    Most of my webscraping code is in Python, do you have plans to implement bindings to other languages?

    • zachperkitny 9 hours ago
      I'm glad you like the project!

      You know that's an interesting idea, NodeJS does support WASM. Compiling Python to WASM is possible. I could add custom actions for importing and executing external code. We'll see, it's given me something to think about!

  • himujjal 3 hours ago
    You know a better dsl? jquery.
    • NilMostChill 9 minutes ago
      what is the specific domain of jquery ?>