

GitHub - codertimo/BERT-pytorch: Google AI BERT 2018 pytorch implementation
source link: https://github.com/codertimo/BERT-pytorch
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

README.md
BERT-pytorch
Pytorch implementation of Google AI's 2018 BERT, with simple annotation
BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Paper URL : https://arxiv.org/abs/1810.04805
Introduction
Google AI's BERT paper shows the amazing result on various NLP task (new 17 NLP tasks SOTA), including outperform the human F1 score on SQuAD v1.1 QA task. This paper proved that Transformer(self-attention) based encoder can be powerfully used as alternative of previous language model with proper language model training method. And more importantly, they showed us that this pre-trained language model can be transfer into any NLP task without making task specific model architecture.
This amazing result would be record in NLP history, and I expect many further papers about BERT will be published very soon.
This repo is implementation of BERT. Code is very simple and easy to understand fastly. Some of these codes are based on The Annotated Transformer
Currently this project is working on progress. And the code is not verified yet.
Language Model Pre-training
In the paper, authors shows the new language model training methods, which are "masked language model" and "predict next sentence".
Masked Language Model
Original Paper : 3.3.1 Task #1: Masked LM
Input Sequence : The man went to [MASK] store with [MASK] dog
Target Sequence : the his
Rules:
Randomly 15% of input token will be changed into something, based on under sub-rules
- Randomly 80% of tokens, gonna be a
[MASK]
token - Randomly 10% of tokens, gonna be a
[RANDOM]
token(another word) - Randomly 10% of tokens, will be remain as same. But need to be predicted.
Predict Next Sentence
Original Paper : 3.3.2 Task #2: Next Sentence Prediction
Input : [CLS] the man went to the store [SEP] he bought a gallon of milk [SEP]
Label : Is Next
Input = [CLS] the man heading to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label = NotNext
"Is this sentence can be continuously connected?"
understanding the relationship, between two text sentences, which is not directly captured by language modeling
Rules:
- Randomly 50% of next sentence, gonna be continuous sentence.
- Randomly 50% of next sentence, gonna be unrelated sentence.
Usage
NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) seperator
Welcome to the \t the jungle \n
I can stay \t here all night \n
1. Building vocab based on your corpus
python build_vocab.py -c data/corpus.small -o data/corpus.small.vocab
usage: build_vocab.py [-h] -c CORPUS_PATH -o OUTPUT_PATH [-s VOCAB_SIZE] [-e ENCODING] [-m MIN_FREQ] optional arguments: -h, --help show this help message and exit -c CORPUS_PATH, --corpus_path CORPUS_PATH -o OUTPUT_PATH, --output_path OUTPUT_PATH -s VOCAB_SIZE, --vocab_size VOCAB_SIZE -e ENCODING, --encoding ENCODING -m MIN_FREQ, --min_freq MIN_FREQ
2. Building BERT train dataset with your corpus
python build_dataset.py -d data/corpus.small -v data/corpus.small.vocab -o data/dataset.small
usage: build_dataset.py [-h] -v VOCAB_PATH -c CORPUS_PATH [-e ENCODING] -o OUTPUT_PATH optional arguments: -h, --help show this help message and exit -v VOCAB_PATH, --vocab_path VOCAB_PATH -c CORPUS_PATH, --corpus_path CORPUS_PATH -e ENCODING, --encoding ENCODING -o OUTPUT_PATH, --output_path OUTPUT_PATH
3. Train your own BERT model
python train.py -d data/dataset.small -v data/corpus.small.vocab -o output/
usage: train.py [-h] -d TRAIN_DATASET [-t TEST_DATASET] -v VOCAB_PATH -o OUTPUT_DIR [-hs HIDDEN] [-n LAYERS] [-a ATTN_HEADS] [-s SEQ_LEN] [-b BATCH_SIZE] [-e EPOCHS] optional arguments: -h, --help show this help message and exit -d TRAIN_DATASET, --train_dataset TRAIN_DATASET -t TEST_DATASET, --test_dataset TEST_DATASET -v VOCAB_PATH, --vocab_path VOCAB_PATH -o OUTPUT_DIR, --output_dir OUTPUT_DIR -hs HIDDEN, --hidden HIDDEN -n LAYERS, --layers LAYERS -a ATTN_HEADS, --attn_heads ATTN_HEADS -s SEQ_LEN, --seq_len SEQ_LEN -b BATCH_SIZE, --batch_size BATCH_SIZE -e EPOCHS, --epochs EPOCHS
Author
Junseong Kim, Scatter Lab ([email protected] / [email protected])
License
This project following Apache 2.0 License as written in LICENSE file
Copyright 2018 Junseong Kim, Scatter Lab, respective BERT contributors
Copyright (c) 2018 Alexander Rush : The Annotated Trasnformer
Recommend
-
271
RetinaNet An implementation of RetinaNet in PyTorch.
-
180
PyTorch implementation of the YOLO (You Only Look Once) v2 The YOLOv2 is one of the most popular one-stage o...
-
131
README.md Max-Pooling Loss Loss Max-Pooling for Semantic Image Segmentation Installation Requirements To...
-
51
README.md BERT Introduction BERT, or Bidirectional Embedding Representatio...
-
45
Photo by Clément H on...
-
8
PyTorch预训练Bert模型本文介绍以下内容: 1. 使用transformers框架做预训练的bert-base模型; 2. 开发平台使用Google的Colab平台,白嫖GPU加速;
-
38
-
5
【pytorch】BERT 2022年10月23日 Author:Guofei 文章归类: 0x26_torch 文章编号: 274 ...
-
5
-
10
Introduction Advances in machine learning models that process language have been rapid in the last few years. This progress has left the r...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK