Extracting the DDL Blog Corpus

April 10, 2016 · 2 min · Benjamin Bengfort

We have some simple text analyses coming up and as an example, I thought it might be nice to use the DDL blog corpus as a data set. There are relatively few DDL blogs, but they all are long with a lot of significant text and discourse. It might be interesting to try to do some lightweight analysis on them.

So, how to extract the corpus? The DDL blog is currently hosted on Silvrback which is designed for text-forward, distraction-free blogging. As a result, there isn’t a lot of cruft on the page. I considered doing a scraper that pulled the web pages down or using the RSS feed to do the data ingestion. After all, I wouldn’t have to do a lot of HTML cleaning.

Then I realized – hey, we have all the Markdown in a repository!

By having everything in one place, as Markdown, I don’t have to do a search or a crawl to get all the links. Moreover, I get a bit finer-grained control of what text I want. The question came down to rendering – do I try to analyze the Markdown, or do I render it into HTML?

In the end, I figured rendering the Markdown to HTML with Python would probably provide the best corpus result. I’ve created a tool that takes a directory of Markdown files, renders them as HTML or text and then creates the corpus organized directory expected by NLTK. Nicely, this also works with Jekyll! Here is the code:

#!/usr/bin/env python # mdec # markdown extract corpus (mdec) # # Author: Benjamin Bengfort <[email protected]> # Created: Sun Apr 10 06:55:15 2016 -0400 # # Copyright (C) 2016 Bengfort.com # For license information, see LICENSE.txt # # ID: mdec.py [] [email protected] $

""" Markdown extract corpus program - used for extracting a corpus of blogs from a Markdown directory (like Jekyll or Silvrback). """

########################################################################## ## Imports ##########################################################################

import os import bs4 import glob import codecs import argparse import markdown import datetime

MARKDOWN_EXTENSIONS = ['.md', '.mdown', '.markdown']

########################################################################## ## Corpus Extraction Tool ##########################################################################

class MDCorpusExtract(object):

def __init__(self, indirs, outdir, format='html', **kwargs): self.mkdown = indirs self.corpus = outdir self.format = format.lower() self.exts = kwargs.get('extensions', MARKDOWN_EXTENSIONS)

def extract(self): """ Main entry point to the corpus extraction function. """ if not os.path.exists(self.corpus): os.makedirs(self.corpus)

for idx, path in enumerate(self.files()): with codecs.open(path, 'r', encoding='utf-8') as f: outpath = self.get_output_path(path) with codecs.open(outpath, 'w', encoding='utf-8') as o: o.write(self.render(f.read()))

self.write_readme()

print("Wrote {} markdown files to {}".format(idx+1, self.corpus))

def files(self, path=None): """ Lists out the files with a markdown extension either in the passed in directory or one level below the passed in directory (to read files from Jekyll or Silvrback blog repositories). """ # If the path is None, then use the input markdown directories if path is None: for mkdir in self.mkdown: for fileid in self.files(mkdir): yield fileid

# Recursive call, use the passed in path else: # First do the subdirectories of this directory for name in os.listdir(path): name = os.path.join(path, name) if os.path.isdir(name): for fileid in self.files(name): yield fileid

# Next do the glob of the directory for ext in self.exts: if not ext.startswith("*"): ext = "*" + ext pattern = os.path.join(path, ext) for fileid in glob.glob(pattern): yield fileid

def render(self, text): """ Renders the text using the Markdown render. If text is specified as the format, then it will also return the readability summary as well. """ if self.format not in {'html', 'txt'}: raise TypeError("Unknown format '{}'".format(self.format))

html = markdown.markdown(text, extensions=['markdown.extensions.extra'])

if self.format == 'html': return html

soup = bs4.BeautifulSoup(html, 'lxml')

if self.format == 'txt': return soup.get_text()

def get_output_path(self, path): """ Constructs the output path from the input path """ basename = os.path.basename(path) name, ext = os.path.splitext(basename) name = "{}.{}".format(name, self.format) return os.path.join(self.corpus, name)

def write_readme(self): """ Helper function to quickly write a README """ readme = os.path.join(self.corpus, 'README.md') output = ( "# Markdown Extracted Corpus\n\n" "This corpus was extracted using the mdec.py extractor.\n\n" "Info:\n" "- extracted on: {}\n" "- number of files: {}\n" "- extraction format: {}\n" "- extracted to: {}\n\n" "Input Sources:\n- {}\n\n" ).format( datetime.datetime.now().strftime('%c'), sum(1 for f in self.files()), self.format, self.corpus, "\n- ".join(self.mkdown) )

with codecs.open(readme, 'w', encoding='utf-8') as o: o.write(output)

########################################################################## ## Main Method ##########################################################################

if __name__ == '__main__': # Create argument parser parser = argparse.ArgumentParser( description="extract a corpus of blogs from Markdown directory", epilog="for bugs or issues, comment on the Gist at http://bit.ly/1Ni7Tas" )

# Add arguments to the parser parser.add_argument( '-v', '--version', action='version', version='mdec.py 1.0' )

parser.add_argument( '-F', '--format', choices=['html', 'txt'], default='html', help='specify the corpus output format as html or text' )

parser.add_argument( '-C', '--corpus', metavar='DIR', required='true', type=str, help='specify the directory to create the corpus in' )

parser.add_argument( 'mddir', metavar='DIR', nargs='+', help='directories with markdown to extract' )

# Create extractor and extract args = parser.parse_args() extractor = MDCorpusExtract( args.mddir, args.corpus, format=args.format ) extractor.extract()

Sorry that was so long, I tried to cut it down a bit, but the argparse stuff really does make it quite verbose. Still the basic methodology is to loop through all the files (recursively going down subdirectories) looking for *.md or *.markdown files. I then use the Python Markdown library with the markdown.extensions.extra package to render HTML, and to render the text from the HTML, I’m currently using BeautifulSoup get_text.

Note also that this tool writes a README with information about the extraction. You can now use the nltk.PlainTextCorpusReader to get access to this text!

Extracting the DDL Blog Corpus

Extracting the DDL Blog Corpus

Recommend

APT34 Glimpse 工具初探

[NCTF2019]phar matches everything

使用 CMake 快速制作 RPM 安装包

plsql游标

生化环材专业的教授们的伪善

新IT引领制造业“数智化”破局，联想的实践方法论

中国留学生不怕外国对手就怕华人导师

决斗之城 Android Auto.js 自动挂机

虚拟机与hyper-v冲突

PHP-中文转换成拼音

About Joyk