47

GitHub - slashbit/spider-less: Web spider as a service, spider on serverless, th...

 5 years ago
source link: https://github.com/slashbit/spider-less
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

README.md

spider-less

Web spider on Serverless!

About Spiderless

Spiderless is the backend layer of KMPPP, a web spider as a service application, it allows you to monitor and get notified of nearly anything on the web. It is built on top of these technologies:

Technology Used For Bulma, Buefy UI Vue.js Front-end logic AWS S3 Website hosting AWS Lambda Backend API AWS SNS Message queue AWS DynamoDB Database AWS API Gateway API gateway AWS Cloudfront CDN AWS Route 53 DNS

API Endpoints

GET subscriptions

Description

Get a list of subscriptions (a maximum of 1 MB of data limited by DynamoDB).

Parameters

None

Request

curl /api/subscriptions

Response

[
  {
    "createdAt": 1544833435070,
    "targets": [
      {
        "selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span",
        "label":"ratingCount"
      }
    ],
    "id": "b4d98de0-ffff-11e8-a4c9-9b9ee9089058",
    "url": "https://www.imdb.com/title/tt0111161/",
    "interval": 60
  }
]

POST subscriptions

Description

Create a new subscription to feed the spider.

Parameters

  • url (required) - Target website url
  • targets (required) - List of css selectors from which text contents are expected to be extracted
  • interval (required) - The interval (in minutes) between scrape

Request

curl -X POST /api/subscriptions -d '{"url":"https://www.imdb.com/title/tt0111161/","targets":"[{\"label\":\"ratingCount\",\"selector\":\"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span\"}]","interval":"60"}' -H "Content-Type: application/json"

Response

{
  "id": "ef417d30-ffff-11e8-a4c9-9b9ee9089058",
  "url": "https://www.imdb.com/title/tt0111161/",
  "targets": [
    {
      "label":"ratingCount",
      "selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span"
    }
  ],
  "interval": 60,
  "createdAt": 1544833533059,
  "updatedAt": 1544833533059
}

DELETE subscriptions

Description

Delete a subscription.

Parameters

  • id (required) - Subscription id

Request

curl -X DELETE /api/subscriptions/:id

Response

{
  "id": "d72c05d0-ffff-11e8-a4c9-9b9ee9089058"
}

Functions List

scrape

Description

Scrape target websites and extract target contents.

Invoke

yarn invoke:local scrape -d '{"createdAt":1544833435070,"updatedAt":1544833435070,"targets":[{"selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span","label":"ratingCount"}],"id":"b4d98de0-ffff-11e8-a4c9-9b9ee9089058","url":"https://www.imdb.com/title/tt0111161/","interval":60}'

Response

[
  {
    "label": "ratingCount",
    "content": "2,025,796"
  }
]

cron

Description

Fetch subscriptions from database and filter out the ones need to be executed.

Invoke

yarn invoke:local cron

Response

None

Development

# install dependencies
yarn install

# start api server on port 8090
yarn start

# invoke function locally
yarn invoke:local function_name

# invoke remote function
yarn invoke cron function_name

Deploy

# first setup your aws credentials https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html
yarn deploy

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK