

Databricks Connect. Ease of personal computer with… | by Sanjay Singh | Sanrusha...
source link: https://medium.com/sanrusha-consultancy/databricks-connect-4bc269dfc94d
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Databricks Connect
Ease of personal computer with greatness of Databricks Spark Cluster
Photo by Bruce Dixon on UnsplashShilpa, our data scientist, is in love with Spark after going through our below articles on Spark and Pyspark.
While Spark, with its in-memory computation capability and real-time data streaming, is making her life better, using IDE tools on databricks cluster is not same as using IDE tools on her own computer. She wants to connect her own computer to databricks cluster and develop the Spark application on her own computer’s Jupyter notebook. And this is where Databricks connect can help her.
What is Databricks Connect?
Below is explanation from Databricks website
Databricks Connect allows you to connect your favorite IDE (Eclipse, IntelliJ, PyCharm, RStudio, Visual Studio Code), notebook server (Jupyter Notebook, Zeppelin), and other custom applications to Databricks clusters.
Prerequisites
Below are per-requisites for installing databricks connect.
- Python: Make sure you have appropriate version of Python installed on your computer. Your Python version should be compatible with the Databricks cluster runtime version. Below is matrix of Cluster run time and Python version from Databricks .

2. Java: Spark runs in java virtual machine (JVM). Make sure you have appropriate version of Java (1.8 and above is recommended) on your computer.
3. Winutil: If your computer has windows operating system, download winutil and define HADOOP_HOME environment variable. I downloaded hadoop-common-2.2.0-bin-master.zip directory from here. Unzip the zip file and it contains below files.

Setup environment variable HADOOP_HOME up to the folder, just before the bin folder containing winutil file.

4. Databricks Cluster : You will need access to Databricks Standard (or better) workspace. Community edition will not work, because it does not has option to create access token, which is required for accessing the cluster from your computer.At the time of writing this article, databricks was offering free access to there Standard cluster for 14 days.
Create a Spark cluster in the workspace. Make sure to pick the runtime version in line with the Python version installed on your computer.

Installation & Configuration
Once you took care of the per-requisites are the steps for installing, configuring, and testing Databricks connect on your computer.
a. Uninstall PySpark: If you has already installed Pyspark, uninstall it. It will get installed again while installing the databricks connect.

b. Install databricks connect
pip install -U "databricks-connect==9.1"

The version 9.1 is databricks cluster runtime version. Put appropriate version based on your runtime version.
c. Now that databricks connect is installed, it is time to configure it. Following information are required for configuring the databricks connect.
i) Cluster ID: The values after …clusters/ in the cluster URL is cluster id. In below example the cluster id is 1015–041759-ui3itm88
ii) Host: The value up to o= is cluster host. In below example the cluster host is https://dbc-08951d2d-3041.cloud.databricks.com/?o=3676650223046134
iii)Org id. The value after = in the url is org id. In this example 3676650223046134 is org id.

iv) Token: Go to user settings , Access Tokens and Generate New Token. Copy the token value.

Once you have all the information listed above (i to iv), run below command on your computer
databricks-connect configure
Provide the requested information


Congratulations, databricks connect is installed and configured on your computer. Now you can test it.
Now, you are ready to test databricks connect. Run below command to test databricks connect.
databricks-connect test

If you see the message “All tests passed”, you are all set. The databricks connect is working. Now you can open Jupyter notebook and start developing Spark application.
Implementation
It’s time to implement databricks connect.
Download Pima Indian Diabetes Database diabetes file from below location on Kaggle.
Upload the diabetes.csv file on the cluster.

Now open Jupyter notebook and start developing Spark application on Databricks cluster.

You can run review the Pyspark SQL, Machine Learning etc. scripts from my previous article and run it on your own Jupyter notebook.
Conclusion
Spark is revolutionary. Databricks cluster is in great demand for running the Spark cluster. Databricks offered the much awaited flexibility of developing Spark application from your own computer.
Happy developing Spark application through databricks connect!
References:
https://www.udemy.com/course/apache-spark-for-data-engineers/?referralCode=CA92888DA98AEA3315AC
Recommend
-
40
文章发出之后,Jeff Dean 表示:「我认为这篇文章精准地捕捉了我们的工作风格。」 让我们看看这篇历时一年半才写成的「黑历史」都说了...
-
11
SparkFrom understanding core concepts to developing a well-functioning Spark application on AWS Instance!
-
15
Oracle Database on AWSStep-by-step instructions on downloading, installing, configuring, and running fully functional Oracle Database on AWS EC2 instance.Photo by
-
4
Micron CEO Sanjay Mehrotra sees semiconductor growth in autos, 5GKey PointsIn this articleVIDEO04:03Micron CEO Sanjay Mehrotra on earnings beat des...
-
9
Snyk Team Welcoming Sanjay Poonen to the Snyk Board of Directors
-
11
Python: @property decoratorPython programmer’s icing on cakePhoto by
-
6
Sanjay Sachdev (@sasachde)AuthorityTotal Hits
-
4
Monday, 06 December 2021 12:56 iTWireTV Interview: Sanjay Galal, SYSPRO CFO, talks Industry 4.0, CFO challenges and modern manufacturing By
-
5
ResponsesThere are currently no responses for this story.Be the first to respond.You have 2 free member-only stories left this month.
-
12
Machine Learning Cross (K-fold) Validation IntroductionMachine learning involves splitting available o...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK