GitHub - philipperemy/tensorflow-1.4-billion-password-analysis: Deep Learning mo...
source link: https://github.com/philipperemy/tensorflow-1.4-billion-password-analysis
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
1.4 Billion Text Credentials Analysis (NLP)
Using deep learning and NLP to analyze a large corpus of clear text passwords.
Objectives:
- Train a generative model.
- Understand how people change their passwords over time: hello123 -> h@llo123 -> h@llo!23.
Disclaimer: for research purposes only.
In the press
Get the data
- Download any Torrent client.
- Here is a magnet link you can find on Reddit:
- magnet:?xt=urn:btih:7ffbcd8cee06aba2ce6561688cf68ce2addca0a3&dn=BreachCompilation&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fglotorrents.pw%3A6969&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337
- Checksum list is available here: checklist.chk
./count_total.sh
inBreachCompilation
should display something like 1,400,553,870 rows.
Get started (processing + deep learning)
Process the data and run the first deep learning model:
# make sure to install the python deps first. Virtual env are recommended here.
# virtualenv -p python3 venv3; source venv3/bin/activate; pip install -r requirements.txt
# Remove "--max_num_files 100" to process the whole dataset (few hours and 50GB of free disk space are required.)
./process_and_train.sh <BreachCompilation path>
Data (explanation)
INPUT: BreachCompilation/
BreachCompilation is organized as:
- a/ - folder of emails starting with a
- a/a - file of emails starting with aa
- a/b
- a/d
- ...
- z/
- ...
- z/y
- z/z
OUTPUT: - BreachCompilationAnalysis/edit-distance/1.csv
- BreachCompilationAnalysis/edit-distance/2.csv
- BreachCompilationAnalysis/edit-distance/3.csv
[...]
> cat 1.csv
1 ||| samsung94 ||| samsung94@
1 ||| 040384alexej ||| 040384alexey
1 ||| HoiHalloDoeii14 ||| hoiHalloDoeii14
1 ||| hoiHalloDoeii14 ||| hoiHalloDoeii13
1 ||| hoiHalloDoeii13 ||| HoiHalloDoeii13
1 ||| 8znachnuu ||| 7znachnuu
EXPLANATION: edit-distance/ contains the passwords pairs sorted by edit distances.
1.csv contains all pairs with edit distance = 1 (exactly one addition, substitution or deletion).
2.csv => edit distance = 2, and so on.
- BreachCompilationAnalysis/reduce-passwords-on-similar-emails/99_per_user.json
- BreachCompilationAnalysis/reduce-passwords-on-similar-emails/9j_per_user.json
- BreachCompilationAnalysis/reduce-passwords-on-similar-emails/9a_per_user.json
[...]
> cat 96_per_user.json
{
"1.0": [
{
"edit_distance": [
0,
1
],
"email": "[email protected]",
"password": [
"090698d",
"090698D"
]
},
{
"edit_distance": [
0,
1
],
"email": "[email protected]",
"password": [
"5555555555q",
"5555555555Q"
]
}
EXPLANATION: reduce-passwords-on-similar-emails/ contains files sorted by the first 2 letters of
the email address. For example [email protected] will be located in 96_per_user.json
Each file lists all the passwords grouped by user and by edit distance.
For example, [email protected] had 2 passwords: 090698d and 090698D. The edit distance between them is 1.
The edit_distance and the password arrays are of the same length, hence, a first 0 in the edit distance array.
Those files are useful to model how users change passwords over time.
We can't recover which one was the first password, but a shortest hamiltonian path algorithm is run
to detect the most probably password ordering for a user. For example:
hello => hello1 => hell@1 => hell@11 is the shortest path.
We assume that users are lazy by nature and that they prefer to change their password by the lowest number
of characters.
Run the data processing alone:
python3 run_data_processing.py --breach_compilation_folder <BreachCompilation path> --output_folder ~/BreachCompilationAnalysis
If the dataset is too big for you, you can set max_num_files
to something between 0 and 2000.
- Make sure you have enough free memory (8GB should be enough).
- It took 1h30m to run on a Intel(R) Core(TM) i7-6900K CPU @ 3.20GHz (on a single thread).
- Uncompressed output is around 45G.
Recommend
-
190
Keras: Deep Learning for humans This repository hosts the development of the Keras library. Read the documentation at keras.io. About Keras Keras is a deep learning API writt...
-
180
README.md TensorFlow Model Analysis TensorFlow Model Analysis (TFMA) is a library for evaluating TensorFlow models. It allows users to evaluate their models on large amounts of...
-
117
README.md Deep-Learning-21-Examples 本工程是《21个项目玩转深度学习———基于TensorFlow的实践详解》的配套代码,代码推荐的运行环境为:Ubuntu 14.04,Python 2.7、TensorFlow >= 1.4.0。请尽量使用类UN...
-
201
README.md TensorFlow Ranking TensorFlow Ranking is a library for Learning-to-Rank (LTR) techniques on the TensorFlow platform. It contains the followin...
-
29
README.md Deep Learning with TensorFlow 2 and Keras – Notebooks This project accompanies my Deep Learning with TensorFlow 2 and Keras...
-
49
README.md Swift for TensorFlow APIs This repository hosts Swift for TensorFlow's deep learning librar...
-
50
README.md Neural Structured Learning in TensorFlow
-
18
Zero to Mastery Deep Learning with TensorFlow Update (29 April 2021): The ZT...
-
10
TensorFlow Similarity: Metric Learning for Humans TensorFlow Similarity is a TensorFLow library for similarity learning also known...
-
3
@gtmars.comMr.Vicgtmars.com, Sharing knowledge in the digital world about ...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK