36

GitHub - eukaryote31/openwebtext: An open clone of the GPT-2 WebText dataset by...

 5 years ago
source link: https://github.com/eukaryote31/openwebtext
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

README.md

OpenWebText

This project is a clone of the GPT-2 WebText dataset as outlined in the OpenAI paper. This project is still heavily WIP.

Dependencies

Pipenv, Python 3,

To install python dependencies:

pipenv install

Newspaper Dependencies:

On Ubuntu:

sudo apt-get install libxml2-dev libxslt-dev

On OS X:

brew install libxml2 libxslt

Usage

  1. Get list of URLs from reddit:
pipenv run python get_urls.py
  1. Download data from URLs:
pipenv run python download.py

Resulting files will be deposited in data/ with format {domain}-{sha256 hash of url}.txt.

Enjoy!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK