GitHub - slashbit/spider-less: Web spider as a service, spider on serverless, th...
source link: https://github.com/slashbit/spider-less
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
README.md
spider-less
Web spider on Serverless!
About Spiderless
Spiderless is the backend layer of KMPPP, a web spider as a service application, it allows you to monitor and get notified of nearly anything on the web. It is built on top of these technologies:
Technology Used For Bulma, Buefy UI Vue.js Front-end logic AWS S3 Website hosting AWS Lambda Backend API AWS SNS Message queue AWS DynamoDB Database AWS API Gateway API gateway AWS Cloudfront CDN AWS Route 53 DNSAPI Endpoints
GET
subscriptions
Description
Get a list of subscriptions (a maximum of 1 MB of data limited by DynamoDB).
Parameters
None
Request
curl /api/subscriptions
Response
[ { "createdAt": 1544833435070, "targets": [ { "selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span", "label":"ratingCount" } ], "id": "b4d98de0-ffff-11e8-a4c9-9b9ee9089058", "url": "https://www.imdb.com/title/tt0111161/", "interval": 60 } ]
POST
subscriptions
Description
Create a new subscription to feed the spider.
Parameters
- url (required) - Target website url
- targets (required) - List of css selectors from which text contents are expected to be extracted
- interval (required) - The interval (in minutes) between scrape
Request
curl -X POST /api/subscriptions -d '{"url":"https://www.imdb.com/title/tt0111161/","targets":"[{\"label\":\"ratingCount\",\"selector\":\"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span\"}]","interval":"60"}' -H "Content-Type: application/json"
Response
{ "id": "ef417d30-ffff-11e8-a4c9-9b9ee9089058", "url": "https://www.imdb.com/title/tt0111161/", "targets": [ { "label":"ratingCount", "selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span" } ], "interval": 60, "createdAt": 1544833533059, "updatedAt": 1544833533059 }
DELETE
subscriptions
Description
Delete a subscription.
Parameters
- id (required) - Subscription id
Request
curl -X DELETE /api/subscriptions/:id
Response
{ "id": "d72c05d0-ffff-11e8-a4c9-9b9ee9089058" }
Functions List
scrape
Description
Scrape target websites and extract target contents.
Invoke
yarn invoke:local scrape -d '{"createdAt":1544833435070,"updatedAt":1544833435070,"targets":[{"selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span","label":"ratingCount"}],"id":"b4d98de0-ffff-11e8-a4c9-9b9ee9089058","url":"https://www.imdb.com/title/tt0111161/","interval":60}'
Response
[ { "label": "ratingCount", "content": "2,025,796" } ]
cron
Description
Fetch subscriptions from database and filter out the ones need to be executed.
Invoke
yarn invoke:local cron
Response
None
Development
# install dependencies yarn install # start api server on port 8090 yarn start # invoke function locally yarn invoke:local function_name # invoke remote function yarn invoke cron function_name
Deploy
# first setup your aws credentials https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html
yarn deploy
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK