GitHub - alash3al/scraply: Scraply a simple dom scraper to fetch information fro...

2 years ago

source link: https://github.com/alash3al/scraply?v=3.0.0
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Simple Scraping Tool

Scraply, is a very simple html scraping tool, if you know css & jQuery then you can use it!

Overview

you can use scraply within your stack via cli or http.

# here is the CLI usage

# extracting the title and the description from scraply github repo page
$ scraply extract \
    -u "https://github.com/alash3al/scraply" \
    -x title="select('title').text()" \
    -x description="select('meta[name=description]').attr('content')"

# same thing but with custom user agent
$ scraply extract \
    -u "https://github.com/alash3al/scraply" \
    -ua "OptionalCustomUserAgent"\
    -x title="select('title').text()" \
    -x description="select('meta[name=description]').attr('content')"

# same thing but with asking scraply to return the response body for debuging purposes
$ scraply extract \
    --return-body \
    -u "https://github.com/alash3al/scraply" \
    -x title="select('title').text()" \
    -x description="select('meta[name=description]').attr('content')"

for http usage, we will run the http server then using any http client to interact with it.

# running the http server
# by default it listens on address ":8010" which equals to "0.0.0.0:8010"
# for more information execute `$ scraply help`
$ scraply serve

# then in another shell let's execute the following curl 
$ curl http://localhost:8010/extract \
    -H "Content-Type: application/json" \
    -s \
    -d '{"url": "https://github.com/alash3al/scraply", "extractors": {"title": "$(\"title\").text()"}, "return_body": false, "user_agent": "CustomeUserAgent"}'

Download ?

you can go to the releases page and pick the latest version. or you can $ docker run --rm -it ghcr.io/alash3al/scraply scraply help

Contribution ?

for sure you can contribute, how?

clone the repo
create your fix/feature branch
create a pull request

nothing else, enjoy!

About

I'm Mohamed Al Ashaal, a software engineer :)

Recommend

Github github.com 6 years ago
Cache

GitHub - alash3al/tix: a super simple stupid event-loop kernel in pure PHP

a super simple stupid event-loop kernel in pure PHP

Github github.com 6 years ago
Cache

GitHub - alash3al/redix: a very fast persistent pure key - value store, that use...

README.md Redix a very fast persistent pure key - value store, that uses the same RESP prot...

Github github.com 6 years ago
Cache

Release v1.4 · alash3al/redix · GitHub

a persistent real-time key-value store, with the same redis protocol with powerful features - alash3al/redix

Github github.com 6 years ago
Cache

GitHub - alash3al/sqler: write APIs using direct SQL queries with no hassle, let...

README.md SQLer SQL-er is a tiny http server that applies the old CGI concept but for SQL queries,...

120

Github github.com 5 years ago
Cache

GitHub - th3unkn0n/TeleGram-Scraper: telegram group scraper tool. fetch all info...

README.md

Github github.com 3 years ago
Cache

GitHub - alash3al/redix: a very simple pure key => value storage system that...

Redix v5 redix is a very simple key => value storage engine that speaks redis and even more simpler and flexible. Why did I build this? redis

Github github.com 3 years ago
Cache

GitHub - alash3al/vidutils: [WIP] a very simple, tiny and intuitive ffmpeg wrapp...

About a very simple, tiny and intuitive ffmpeg wrapper with a cli interface for inspecting & transforming media files supported by the original ffmpeg software. I wanted to learn mo...

Github github.com 2 years ago
Cache

scraper/datasets/javascript-libs-from-top-1mm-sites at main · get-set-fetch/scra...

Javascript Libraries From Top 1 Million Sites CSV files available as open access dataset getsetfetch-dataset-javascript-libraries.csv.gz (146 MB)

Github github.com 2 years ago
Cache

GitHub - alash3al/katch: a simple wrapper for headless chrome available via http...

katch! a very simple wrapper utility for headless chrome to easily export any webpage as png, jpeg, pdf or html (prerender), you can use it via http or...

Github github.com 1 year ago
Cache

GitHub - alash3al/phoo: a very simple high performance PHP application server an...

alash3al/phoo master