Skraper

~~Here should be some modern logo~~

Overview

Kotlin/JVM library and cli tool which allows scraping and downloading posts, attachments, other meta from more than 10 sources without any authorization or full page rendering. Based on coroutines and JSoup.

Repository contains:

Current list of implemented sources:

Unfortunately, each web-site is subject to change without any notice, so the tool may work incorrectly because of that. If that happens, please let me know via an issue or some message.

Cli tool

Cli tool allows to:

download media with flag --media-only from almost all presented sources.
scrape posts meta information

Requirements:

Java: 1.8 +
Maven (optional)

Build tool

./mvnw clean package -DskipTests=true

Usage:

./skraper --help

usage: [-h] PROVIDER PATH [-n LIMIT] [-t TYPE] [-o OUTPUT] [-m]
       [--parallel-downloads PARALLEL_DOWNLOADS]

optional arguments:
  -h, --help                                show this help message and exit

  -n LIMIT, --limit LIMIT                   posts limit (50 by default)

  -t TYPE, --type TYPE                      output type, options: [log, csv, json, xml, yaml]

  -o OUTPUT, --output OUTPUT                output path

  -m, --media-only                          scrape media only

  --parallel-downloads PARALLEL_DOWNLOADS   amount of parallel downloads for media items if
                                            enabled flag --media-only (4 by default)


positional arguments:
  PROVIDER                                  skraper provider, options: [facebook, instagram,
                                            twitter, youtube, twitch, reddit, ninegag, pinterest,
                                            flickr, tumblr, ifunny, vk, pikabu]

  PATH                                      path to user/community/channel/topic/trend
usage: [-h] PROVIDER PATH [-n LIMIT] [-t TYPE] [-o OUTPUT] [-m]
       [--parallel-downloads PARALLEL_DOWNLOADS]

optional arguments:
  -h, --help                                show this help message and exit

  -n LIMIT, --limit LIMIT                   posts limit (50 by default)

  -t TYPE, --type TYPE                      output type, options: [log, csv, json, xml, yaml]

  -o OUTPUT, --output OUTPUT                output path

  -m, --media-only                          scrape media only

  --parallel-downloads PARALLEL_DOWNLOADS   amount of parallel downloads for media items if
                                            enabled flag --media-only (4 by default)


positional arguments:
  PROVIDER                                  skraper provider, options: [facebook, instagram,
                                            twitter, youtube, twitch, reddit, ninegag, pinterest,
                                            flickr, tumblr, ifunny, vk, pikabu]

  PATH                                      path to user/community/channel/topic/trend

Examples:

./skraper ninegag /hot 
./skraper reddit /r/memes -n 5 -t csv -o ./reddit/posts
./skraper youtube /user/JetBrainsTV/videos --media-only -n 2

Kotlin Library

Distribution

Maven:

<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>com.github.sokomishalov.skraper</groupId>
        <artifactId>skrapers</artifactId>
        <version>0.5.1</version>
    </dependency>
</dependencies>

Gradle kotlin dsl:

repositories {
    maven { url = uri("http://jitpack.io") }
}
dependencies {
    implementation("com.github.sokomishalov.skraper:skrapers:0.5.1")
}

Usage

You may take a look at library usage in this android sample app or telegram bot

Instantiate specific scraper

As mentioned before, the provider implementation list is:

After that usage as simple as is:

val skraper = InstagramSkraper(client = OkHttpSkraperClient())

Important moment: it is highly recommended to not use DefaultBlockingSkraperClient. There are some more efficient, non-blocking and resource-friendly implementations for SkraperClient. To use them you just have to put required dependencies in the classpath.

Current http-client implementation list:

Available methods

Each scraper is a class which implements Skraper interface:

interface Skraper {
    val baseUrl: URLString
    val client: SkraperClient get() = DefaultBlockingSkraperClient
    suspend fun getProviderInfo(): ProviderInfo?
    suspend fun getPageInfo(path: String): PageInfo?
    suspend fun getPosts(path: String, limit: Int = DEFAULT_POSTS_LIMIT): List<Post>
    suspend fun resolve(media: Media): Media
}

Also, there are some provider-specific kotlin extensions for implementations. You can find them out at the provider implementation package.

Scrape user/community/channel/topic/trend posts

To scrape the latest posts for specific user, channel or trend use skraper like that:

suspen fun main() {
    val skraper = FacebookSkraper()
    val posts = skraper.getUserPosts(username = "memes", limit = 2) // extension for getPosts()
    println(JsonMapper().writerWithDefaultPrettyPrinter().writeValueAsString(posts))
}

Received data structure is similar to each other provider's. Output data example:

[
  {
    "id" : "5029851093699104",
    "text" : "gotta love em!",
    "publishedAt" : 1580744400000,
    "rating" : 79,
    "commentsCount" : 3,
    "media" : [ {
      "url" : "https://facebook.com/memes/posts/5029851093699104?__xts__%5B0%5D=68.ARA2yRI2YnlXQRKX7Pdphh8ztgvnP11aYE_bZFPNmqLpJZLhwJaG24gDPUTiKDLv-J_E09u2vLjCXalpmEuGSmVR0BkVtcng_i6QV8x5e-aZUv0Mkn1wwKLlhp5NNH6zQWKlqDqRjZrwvcKeUi0unzzulRCHRvDIrbz2leM6PLescFySwMYbMmKFc7ctqaC_F7nJ09Ya0lz9Pqaq_Rh6UsNKom6fqdgHAuoHV894a3QRuyY0BC6fQuXZLOLbRIfEVK3cF9Z5UQiXUYruCySF-WpQEV0k72x6DIjT6B3iovYFnBGHaji9VAx2PByZ-MDs33D1Hz96Mk-O1Pj7zBwO6FvXGhkUJgepiwUOVd0q-pV83rS5EhjtPFDylNoNO2xkDUSIi483p49vumVPWtmab8LX1V6w2anf55kh6pedCXcH3D8rBjz8DaTBnv995u9kk5im-1-HdAGQHyKrCZpaA0QyC-I4oGsCoIJGck3RO8u_SoHcfe2tKjTgPe6j9p1D&__tn__=-R",
      "aspectRatio" : 0.864,
      "duration" : 10860.000000000
    } ]
  }, {
    "id" : "4990218157662398",
    "text" : "Interesting",
    "publishedAt" : 1580742000000,
    "rating" : 3092,
    "commentsCount" : 514,
    "media" : [ {
      "url" : "https://scontent.fhrk1-1.fna.fbcdn.net/v/t1.0-0/p526x296/52333452_10157743612509879_529328953723191296_n.png?_nc_cat=1&_nc_ohc=oNMb8_mCbD8AX-w9zeY&_nc_ht=scontent.fhrk1-1.fna&oh=ca8a719518ecfb1a24f871282b860124&oe=5E910D0C",
      "aspectRatio" : 0.8960573476702509
    } ]
  }
]

You can see the full model structure for posts and others here

Scrape user/community/channel/topic/trend info

It is possible to scrape user/channel/trend info for some purposes:

suspend fun main() {
    val skraper = TwitterSkraper()
    val pageInfo = skraper.getUserInfo(username = "memes") // extension for `getPageInfo()`
    println(JsonMapper().writerWithDefaultPrettyPrinter().writeValueAsString(pageInfo))
}

Output:

{
  "nick" : "memes",
  "name" : "Memes.com",
  "description" : "http://memes.com is your number one website for the funniest content on the web. You will find funny pictures, funny memes and much more.",
  "postsCount" : 10848,
  "followersCount" : 154718,
  "avatarsMap" : {
    "SMALL" : {
      "url" : "https://pbs.twimg.com/profile_images/824808708332941313/mJ4xM6PH_normal.jpg"
    },
    "MEDIUM" : {
      "url" : "https://pbs.twimg.com/profile_images/824808708332941313/mJ4xM6PH_normal.jpg"
    },
    "LARGE" : {
      "url" : "https://pbs.twimg.com/profile_images/824808708332941313/mJ4xM6PH_normal.jpg"
    }
  },
  "coversMap" : {
    "SMALL" : {
      "url" : "https://abs.twimg.com/images/themes/theme1/bg.png"
    },
    "MEDIUM" : {
      "url" : "https://abs.twimg.com/images/themes/theme1/bg.png"
    },
    "LARGE" : {
      "url" : "https://abs.twimg.com/images/themes/theme1/bg.png"
    }
  }
}

Resolve provider relative url

Sometimes you need to know direct media link:

suspend fun main() {
    val skraper = InstagramSkraper()
    val info = skraper.resolve(Video(url = "https://www.instagram.com/p/B-flad2F5o7/"))
    println(JsonMapper().writerWithDefaultPrettyPrinter().writeValueAsString(info))
}

Output:

{
  "url" : "https://scontent-amt2-1.cdninstagram.com/v/t50.2886-16/91508191_213297693225472_2759719910220905597_n.mp4?_nc_ht=scontent-amt2-1.cdninstagram.com&_nc_cat=104&_nc_ohc=27bC52qar_oAX-7J2Zh&oe=5EC0BC52&oh=0aafee2860c540452b76e7b8e336147d",
  "aspectRatio" : 0.8010012515644556,
  "thumbnail" : {
    "url" : "https://scontent-amt2-1.cdninstagram.com/v/t51.2885-15/e35/91435498_533808773845524_5302421141680378393_n.jpg?_nc_ht=scontent-amt2-1.cdninstagram.com&_nc_cat=100&_nc_ohc=8gPAcByc6YAAX_kDBWm&oh=5edf6b9d90d606f9c0e055b7dbcbfa45&oe=5EC0DDE8",
    "aspectRatio" : 0.8010012515644556
  }
}

Download media

There is "static" method which allows to download any media from all known implemented sources:

suspend fun main() {
    val tmpDir = Files.createTempDirectory("skraper").toFile()

    val testVideo = Skraper.download(
            media = Video("https://youtu.be/fjUO7xaUHJQ"),
            destDir = tmpDir,
            filename = "Gandalf"
    )

    val testImage = Skraper.download(
            media = Image("https://www.pinterest.ru/pin/89509111320495523/"),
            destDir = tmpDir,
            filename = "Do_no_harm"
    )

    println(testVideo)
    println(testImage)
}

Output:

/var/folders/sf/hm2h5chx5fl4f70bj77xccsc0000gp/T/skraper8377953374796527777/Gandalf.mp4
/var/folders/sf/hm2h5chx5fl4f70bj77xccsc0000gp/T/skraper8377953374796527777/Do_no_harm.jpg

Scrape provider logo

It is also possible to scrape provider info for some purposes:

suspend fun main() {
    val skraper = InstagramSkraper()
    val info = skraper.getProviderInfo()
    println(JsonMapper().writerWithDefaultPrettyPrinter().writeValueAsString(info))
}

Output:

{
  "name" : "Instagram",
  "logoMap" : {
    "SMALL" : {
      "url" : "https://instagram.com/favicon.ico"
    },
    "MEDIUM" : {
      "url" : "https://instagram.com/favicon.ico"
    },
    "LARGE" : {
      "url" : "https://instagram.com/favicon.ico"
    }
  }
}

Telegram bot

To use the bot follow the link. You are also able to have a look on the bot main logic code.

GitHub - SokoMishaLov/skraper: Scrapes posts with media from various sources wit...

Skraper

Overview

Cli tool

Kotlin Library

Distribution

Usage

Instantiate specific scraper

Available methods

Scrape user/community/channel/topic/trend posts

Scrape user/community/channel/topic/trend info

Resolve provider relative url

Download media

Scrape provider logo

Telegram bot

Recommend

GitHub - jan-warchol/selenized: selenized color palette - balanced colors for yo...

SD-WAN的进化将走向何方

Effectful Property Testing

Ant Design 4.0 发布，我们和核心作者聊了聊

IMXRT学习记录 – DCP加密

美国第5次延长华为临时许可证！是美企快扛不住了

京东物流不再是刘强东的“资金黑洞”

日本奥运会可能推迟一两年举办

Google 建议北美员工远程办公

消息称因新冠病毒影响苹果公司将取消春季发布会

About Joyk