Tutorial: How to build a Tokenizer in Spark and Scala

Reading Time: 2 minutes

In our earlier blog A Simple Application in Spark and Scala, we explained how to build Spark and make a simple application using it.

In this blog, we will see how to build a fast Tokenizer in Spark & Scala using sbt.

Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining. Although tokenization is a slow process. But, with the help of Spark we can make it fast by running it in chunks/parallel.

In following example we will see how to tokenize (segregate) the words in a text file and count the number of times they occur in the text file (i.e., term frequency).

Before start building this application follow the instructions of building an application in Spark given in here.

After building the application, we can start building the Tokenizer.

To build the Tokenizer, create a file TokenizerApp.scala in your application like this

xxxxxxxxxx

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

object TokenizerApp {

def main(args: Array[String]) {

val logFile = "src/data/sample.txt" // Should be some file on your system

val sc = new SparkContext("local", "Tokenizer App", "/path/to/spark-0.9.1-incubating",

List("target/scala-2.10/simple-project_2.10-1.0.jar"))

val logData = sc.textFile(logFile, 2).cache()

val tokens = sc.textFile(logFile, 2).flatMap(line => line.split(" "))

val termFrequency = tokens.map(word => (word, 1)).reduceByKey((a, b) => a + b)

termFrequency.collect.map(tf => println("Term, Frequency: " + tf))

tokens.saveAsTextFile("src/data/tokens")

termFrequency.saveAsTextFile("src/data/term_frequency")

As we all can see that while processing “textFile” to obtain tokens, we have split the “logFile” into 2 parts.

xxxxxxxxxx

val tokens = sc.textFile(logFile, 2).flatMap(line => line.split(" "))

This makes processing of text file, faster as Spark can process these 2 parts of file in chunks to obtain tokens. So, more the number of splits, the faster execution is.

However, if you want to run Spark on 2 or more cores then specify the number of cores in “local[n]” field of “SparkContext” like this

xxxxxxxxxx

val sc = new SparkContext("local[2]", "Tokenizer App", "/path/to/spark-0.9.1-incubating", List("target/scala-2.10/simple-project_2.10-1.0.jar"))

This will run Tokenizer on 2 cores. Note that, the number of splits should be a multiple of “n“.

To download a Demo Application click here.

Tutorial: How to build a Tokenizer in Spark and Scala

Recommend

分歧的魔鬼藏在「限定条件」里。

Evan Spiegel has described Snap as a camera company. He tells us what that means

Memcache PHP Extensions for Memcached Caching Daemon

Show Online Friends using Node.js with Nice UI Interface

新加坡发出加密货币支付牌照通知，火币、币安、Coinbase、OKCoin等在列

前AFL布里斯班雄狮队足球运动员Jason Akermanis推出加密货币Zucoin

AMD is smashing Intel in retail desktop CPU sales | TechSpot

iCloud Backups and Encryption; Facts, Principles, and Concerns; Determining Defa...

StarkWare 推出以太坊侧链和二层扩容方案 StarkEx 之间无需信任的转账方案

Inside Apple’s Compromises in China: A Times Investigation - The New York Times

About Joyk