Monday, January 26, 2015

Notes: Build, Deploy Spark 1.2.0 And Run Spark-Example In Intellij Idea

Build Reference:
building-spark-1-2-0 - Official Document

Run Spark On Yarn Reference:
running-spark-1-2-0-on-yarn - Official Document

Tutorial:
quick-start - Spark
Spark Programming Guide - Spark
Spark SQL Programming Guide - Spark SQL
Spark RDD API Examples - Zhen He

Environment:
Spark-1.2.0
Hadoop-2.2.0
Hive-0.13.0
Scala-2.10.4


Configure Intellij Idea for Spark Application (SBT version)
Before preparing for Spark environment in Intellij Idea, we should firstly make it possible to write scala in it. In fact, it is much easier than we thought since there is a scala plugin which is available at Plugins Manager: Go to preferences(⌘+,) --> plugins --> install JetBrain plugins --> scala.

After restarting Intellij Idea, we could simply create a SBT scala project by "File --> New Project --> scala --> SBT"


Then, mkdir src/main/java to the root of the project path. Create a new file named 'SimpleApp.scala', whose content is as follows:
/* SimpleApp.scala */

import _root_.org.apache.spark.SparkConf
import _root_.org.apache.spark.SparkContext
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "/user/hadoop/poem" // Should be some file on your system
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}

Apparently, it is a simple Spark Application.

Then we should configure the project library and module so as to make it compile correctly. Open Project Structure(⌘+;). In Modules column, select currently-used Java JDK; whereas in Libraries column, add spark jar file which could be retrieved from online deployed Spark path: '$SPARK_HOME/assembly/target/scala-2.10/spark-assembly-1.2.0-hadoop2.2.0.jar'. At this very moment, a bug of Intellij Idea could appear when importing this .jar file with prompt 'IDEA cannot determine what kind of files the chosen items contain. Choose the appropriate categories from the list.'. Have a look at Possible Error #4 for detailed information.


At this time, no compile error should be found in SimpleApp.scala file.

Since our Spark environment is online, thus I implemented a script in order to upload my Spark application to online node and build it there.
#!/bin/bash

if [ "$#" -ne 1 ]; then
  echo "One argument (Spark_Project_Name) is needed." >&2
  exit 1
fi

rsync -r --delete --exclude=target /Users/jasonzhu/IdeaProjects/$1 hadoop@d80.hide.cn:/home/hadoop/zhudi
ssh hadoop@d80.hide.cn "source /etc/profile; cd /home/hadoop/zhudi/$1; ls -l; sbt package;"

After executing this script and ssh to the online machine, execute the following command to launch the SimpleApp:
spark-submit --class "SimpleApp" --master yarn-cluster --num-executors 3 simple-project_2.11-1.0.jar

Whose result could be seen by:
yarn logs -applicationId application_1421455790417_19073 | grep "Lines with a"


Configure Intellij Idea for Spark Application (non-SBT version)
Still, install scala plugin in Intellij Idea following the aforementioned procedure.

create a scala project by "File --> New Project --> scala --> non-SBT"

Import scala and Spark jar as above. Create a package, a SimpleApp.scala file in that package respectively, in which put the above code in it.

Next, we should specify the way this project assembles when building it. go to Project Structure --> Artifacts --> Add. Configure it as follows:


After configuring, Make Project(⌘+F9). Then we could find the jar file in the above output directory.

Copy the jar file to the online server, launch it via command:
spark-submit --class "com.miaozhen.etl.region.distribution.SimpleApp" --master yarn-cluster --num-executors 3 scala.test.jar



One More Step: Integrate Maven With Scala In IDEA
Firstly, we should create a simple maven project in IDEA. Then we add Scala Facet in Project Structure(⌘+;) as follows:

At this time, no Scala compiler is specified. Thus we have to import scala-compiler in Libraries. Select Add->Java, the select all the .jar files in $SCALA_HOME/lib directory, then click OK.


After which, we should have imported scala-compiler Library as a whole:

We could then specify Scala Compiler in Scala Facet interface:

Add the following dependency in pom.xml:
<dependencies>
    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-lang3</artifactId>
        <version>3.3.2</version>
    </dependency>
</dependencies>

<build>
    <plugins>
        <plugin>
            <artifactId>maven-compiler-plugin</artifactId>
            <configuration>
                <target>1.7</target>
                <source>1.7</source>
                <encoding>UTF-8</encoding>
            </configuration>
        </plugin>
        <plugin>
            <artifactId>maven-assembly-plugin</artifactId>

            <configuration>
                <appendAssemblyId>true</appendAssemblyId>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>

        <plugin>
            <groupId>net.alchim31.maven</groupId>
            <artifactId>scala-maven-plugin</artifactId>
            <version>3.1.6</version>
            <executions>
                <execution>
                    <id>compile-scala</id>
                    <phase>compile</phase>
                    <goals>
                        <goal>add-source</goal>
                        <goal>compile</goal>
                    </goals>
                </execution>
                <execution>
                    <id>test-compile-scala</id>
                    <phase>test-compile</phase>
                    <goals>
                        <goal>add-source</goal>
                        <goal>testCompile</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

In which, 'scala-maven-plugin' will compile and package *.scala files.

Create a Scala Class under src/main/java, and test it as below:
object Test {
  def main(args: Array[String]) {
    println("hello from scala.")
    System.out.println("hello from java.")
    println(org.apache.commons.lang3.StringUtils.isEmpty("test dependency in pom.xml"))
  }
}




Read HDFS Files With Custom InputFormat Applied

import com.miaozhen.dm.sdk.inputformat.mr.DMCombineFileInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{Text, LongWritable}
import org.apache.spark.{SparkContext, SparkConf}

object EtlRegionDist {
  def main(args: Array[String]): Unit = {
    val hconf = new Configuration()
    val conf = new SparkConf().setAppName("ETL_Region_Distribution_[ZHUDI]")
    val sc = new SparkContext(conf)


    val hadoopFile = sc.newAPIHadoopFile(
                              "/hdfs_path/file-r-00013",
                              classOf[DMCombineFileInputFormat],
                              classOf[LongWritable],
                              classOf[Text],
                              hconf)
    val lineCounts = hadoopFile.map(line => (1L)).reduce((a, b) => (a+b))
    println("LineCounts=[%s]".format(lineCounts))
  }
}

There are two sets of method supplied by class SparkContext, namely, `hadoopFile` and `newAPIHadoopFile` respectively. The former is used for InputFormat from package 'org.apache.hadoop.mapred', whereas the latter one is for that from 'org.apache.hadoop.mapreduce'.



Integrate HDFS File with Spark SQL

Here's the main code:

import com.miaozhen.dm.sdk.inputformat.mr.DMCombineFileInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileStatus, Path, FileSystem}
import org.apache.hadoop.io.{Text, LongWritable}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql._

object EtlRegionDist {

  def getFilePaths(hconf: Configuration, dir: String): List[Path] =  {
    val fs = FileSystem.get(hconf)
    val path = new Path(dir)

    var res = List[Path]()
    fs.listStatus(path).foreach(fss => {
      res :+= fss.getPath
    })

    return res
  }

  def main(args: Array[String]): Unit = {
    // -- configuration --
    val hconf = new Configuration()
    val conf = new SparkConf().setAppName("ETL_Region_Distribution_[ZHUDI]")
    val sc = new SparkContext(conf)

    /**
     * Spark SQL Context
     */
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)

    // -- Read multiple HDFS files into a merged HadoopFileRDD --
    var hadoopFileSeqs = Seq[org.apache.spark.rdd.RDD[(org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text)]]()
    getFilePaths(hconf, "/tong/data/output/dailyMerger/20140318").foreach(path => {
      val strPath = path.toString
      if(strPath contains "campaign")  {
        val hadoopFile = sc.newAPIHadoopFile(
          strPath,
          classOf[DMCombineFileInputFormat],
          classOf[LongWritable],
          classOf[Text],
          hconf)
        hadoopFileSeqs :+= hadoopFile
      }
    })
    val hadoopFiles = sc.union(hadoopFileSeqs)


 // -- Create SchemaRDD --
    val schema = StructType("uuid ip plt".split(" ").map(fieldName => StructField(fieldName, StringType, true)))
    val rowRDD = hadoopFiles.map(line =>
          line._2.toString.split("\\^") flatMap {
            field => {
              var pair = field.split("=", 2)
              if(pair.length == 2)
                Some(pair(0) -> pair(1))
              else
                None
            }
          } toMap
        ).map(kvs => Row(kvs("uuid"), kvs("ip"), kvs("plt").trim))
    val schemaRDD = sqlContext.applySchema(rowRDD, schema)
    schemaRDD.registerTempTable("etllog")


    val ipToRegion = new IpToRegionFunction

    sqlContext.registerFunction("ipToRegion", (x:String) => ipToRegion.evaluate(x, "/tong/data/resource/dicmanager/IPlib-Region-000000000000000000000008-top100-top100-top100-merge-20141024182445"))
    var result = sqlContext.sql("SELECT ipToRegion(ip), count(distinct(uuid)) from etllog where plt='0' group by ipToRegion(ip)")
    result.collect().foreach(println)


  }
}

In which, a custom function is applied, as with 'UDF' in Hive. In my case, this custom function is transplanted from a pure java Class Served as UDF:

import org.apache.commons.io.IOUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Serializable;
import java.util.TreeMap;


public class IpToRegionFunction implements Serializable{

    private TreeMap<Pair, Integer> iplib = null;

    public Integer evaluate(String ip, String iplibFile) throws IOException {
        long sum = 0L;

        if (iplib == null) {
            iplib = new TreeMap<Pair, Integer>();
            readFile(iplibFile);
        }
        String[] parts = ip.split("\\.");
        if (parts.length == 4) {
            long a = Long.valueOf(parts[0]);
            long b = Long.valueOf(parts[1]);
            long c = Long.valueOf(parts[2]);
            long d = Long.valueOf(parts[3]);

            sum = a * 256l * 256l * 256l + b * 256l * 256l + c * 256l + d;
        }
        Pair pair = new Pair( sum , sum );

        return iplib.get(iplib.floorKey(pair)) ;
    }

    private void readFile(String filename) throws IOException {
        Configuration conf = new Configuration();
        Path path = new Path(filename);

        String line;
        BufferedReader in = null;
        try {
            FileSystem fs = path.getFileSystem(conf);
            in = new BufferedReader(new InputStreamReader(fs.open(path)));
            int columns = 0;
            while ((line = in.readLine()) != null) {
                String[] arr = line.split(",");
                if (columns == 0) {
                    columns = arr.length;
                }
                if (columns == 4 && arr[0].equals("0")) {
                    iplib.put(new Pair(Long.valueOf(arr[1]), Long.valueOf(arr[2])), Integer.parseInt(arr[3]));
                } else if (columns == 3 && arr.length == 3) {
                    iplib.put(new Pair(Long.valueOf(arr[0]), Long.valueOf(arr[1])), Integer.parseInt(arr[2]));
                }

            }
        } finally {
            IOUtils.closeQuietly(in);
        }
    }

    private static class Pair implements Comparable{
        final long start;
        final long end;

        Pair(long start, long end) {
            this.start = start;
            this.end = end;
        }

        private boolean isBetween(long ip) {
            if (ip >= start && ip <= end)
                return true;
            return false;
        }
        @Override
        public boolean equals(Object o) {
            if (this == o) return true;
            if (o == null || getClass() != o.getClass()) return false;

            Pair pair = (Pair) o;

            if (end != pair.end) return false;
            if (start != pair.start) return false;

            return true;
        }
        public int compareTo(Object other) {
            long other_start = ((Pair)other).start;
            if( start != other_start )
                return (start < other_start) ? -1 : 1;
            else
                return 0;
        }
        @Override
        public int hashCode() {
            int result = (int) (start ^ (start >>> 32));
            result = 31 * result + (int) (end ^ (end >>> 32));
            return result;
        }
    }
}

We don't essentially have to care about the detailed logic in this Java Class. Only one thing need to be aware of: This class must implement Serializable, or it will throw an Exception like "Exception in thread "Driver" org.apache.spark.SparkException: Task not serializable, ..., Caused by: java.io.NotSerializableException: com.miaozhen.etl.region.distribution.IpToRegionFunction".

After build it in IDEA, upload it to server and execute with command:
spark-submit --class "com.hide.region.distribution.EtlRegionDist" --master yarn-cluster --num-executors 32 --driver-memory 8g --executor-memory 4g --executor-cores 4 scala.test.jar


Apply Custom Jar (Lib) To spark-shell

Add your custom jar file path to SPARK_CLASSPATH environment variable will do the trick.
export SPARK_CLASSPATH=/home/workspace/spark-1.2.0-bin-hadoop2/lib_managed/jars/com.hide.dm.sdk-1.0.1-SNAPSHOT.jar

Another way to do so is to use `spark-shell/spark-submit --jars <custom_jars>`.

For both ways, remember to put .jar file both at the gateway for Spark as well as all the nodes in Hadoop cluster, in the same directory.


Execute HQL in Spark SQL

Since a great number of our production tasks are realized via HQL (HiveQL), a transplantation from HQL to Spark SQL is very costly. Fortunately, Spark SQL provides an interface called HiveContext, which supports reading and writing data stored in Hive. However, since Hive has a large number of dependencies, it is not included in the default Spark assembly. To equip Spark with Hive, we have to add -Phive and -Phive-thriftserver flags when building Spark according to official document. But in my case, the following error would be thrown when building Spark provided '-Phive-thriftserver' flag is added.
[ERROR] Failed to execute goal on project spark-assembly_2.10: Could not resolve dependencies
for project org.apache.spark:spark-assembly_2.10:pom:1.2.0: The following artifacts could not
be resolved: org.apache.spark:spark-repl_2.11:jar:1.2.0, org.apache.spark:spark-yarn_2.11:jar:1.2.0,
org.apache.spark:spark-hive-thriftserver_2.11:jar:1.2.0: Failure to find org.apache.spark:spark-repl_2.11:jar:1.2.0
in https://repo1.maven.org/maven2 was cached in the local repository, resolution will not be reattempted
until the update interval of central has elapsed or updates are forced -> [Help 1]

Thus my building command is as follows:
mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -Dscala-2.11 -Phive -DskipTests clean package

The above ERROR could be resolved simply by invoking the following command, since it is stated in Official Document, scala2.11 for spark1.2.0 is still at experimental stage, thus I change my scala to 2.10.4 and it works fine:
mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -Phive -Phive-thriftserver -DskipTests clean package

When importing this new 'spark-assembly-1.2.0-hadoop2.2.0.jar' in Intellij Idea, we could use HiveContext to implement our Hello-HQL-from-SparkSQL project.
package com.miaozhen.etl.region.distribution

import com.miaozhen.dm.sdk.inputformat.mr.DMCombineFileInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileStatus, Path, FileSystem}
import org.apache.hadoop.io.{Text, LongWritable}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql._

object HiveTest {

  def main(args: Array[String]): Unit = {
    // -- configuration --
    val hconf = new Configuration()
    val conf = new SparkConf().setAppName("HiveTest_[ZHUDI]")
    val sc = new SparkContext(conf)
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    val hqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

    hqlContext.sql("show databases").collect().foreach(println)
  }
}

Upon submitting this Spark task, several obstacles have been encountered.

ERROR #1:
Exception in thread "Driver" java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
...
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
...
Caused by: java.lang.reflect.InvocationTargetException
...
Caused by: javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
...
Caused by: java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory
...

Obviously, some jar files are not referenced correctly. After googling, submit the task via the following command will solve the above issue:
spark-submit --class "com.miaozhen.etl.region.distribution.HiveTest" \
--jars /home/workspace/spark-1.2.0-bin-hadoop2/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar,/home/workspace/spark-1.2.0-bin-hadoop2/lib_managed/jars/datanucleus-rdbms-3.2.9.jar,/home/workspace/spark-1.2.0-bin-hadoop2/lib_managed/jars/datanucleus-core-3.2.10.jar,/home/workspace/hadoop/sqoop/lib/mysql-connector-java-5.1.31.jar \
--master yarn-cluster \
--num-executors 16 \
--driver-memory 8g \
--executor-memory 4g \
--executor-cores 4 \
scala.test.jar

ERROR #2:
ERROR metastore.RetryingHMSHandler: NoSuchObjectException(message:There is no database named logbase)
...
15/02/09 10:08:50 ERROR exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: logbase
...
Exception in thread "Driver" org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Database does not exist: logbase
...

This error has two possibilities, one is that the HQL is incorrect, start a hive cli, copy-paste the HQL in it in order to double-check the correctness of the HQL. The other possibility is likely that the 'hive-site.xml' is not loaded by Spark engine. If you are using remote metastore which requires to configure 'thrift' in 'hive-site.xml', check whether there is any info related to 'thrift' in the yarn log:
> grep -i thrift ttt17.txt --color
ttt17.txt:15/02/09 15:42:33 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.7.11:9083

If nothing is related to thrift, then it is sure that the second case is the culprit. According to the solution of a blog, we should put hive-site.xml to '$HADOOP_HOME/etc/hadoop' both at the gateway which deploys the spark environment, and at all the nodes in Hadoop cluster. In this way, problem solved!

P.S. Sometimes, we could launch a `spark-shell` with the same arguments as `spark-submit` to debugging our code and retrieving more detailed and targeted information upon the ERROR.

Some other references are listed here:
1. http://www.it165.net/pro/html/201501/31478.html

ERROR #3:
When using Hive Syntax `add jar`, FileNotFoundException is thrown. According to this post, we have to use `spark-shell/spark-submit --jars <custom_jars>` to specify our own jar files, in which, custom_jars is a list of path for your .jar file delimited by comma. Moreover, remember to put .jar file to all the machines including the gateway for submitting Spark task as well as all the nodes in Hadoop cluster. The example is illustrated as above in ERROR #1.



Possible Error 1)
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

Solution:
1. http://solaimurugan.blogspot.tw/2014/10/error-and-solaution-detailed-step-by.html
2. https://groups.google.com/forum/#!topic/predictionio-user/Bq0HBCM1ytI
3. https://spark.apache.org/docs/1.1.1/spark-standalone.html


Possible Error 2)
When launching a scala project by spark-submit, the following error is thrown:
$ spark-submit --class "com.miaozhen.etl.region.distribution.ListFilesInDir"  scala.test.jar

...
Exception in thread "main" java.lang.NoSuchMethodError: scala.runtime.ObjectRef.create(Ljava/lang/Object;)Lscala/runtime/ObjectRef;
 at com.miaozhen.etl.region.distribution.ListFilesInDir$.getf(ListFilesInDir.scala:34)
 at com.miaozhen.etl.region.distribution.ListFilesInDir$.main(ListFilesInDir.scala:50)
 at com.miaozhen.etl.region.distribution.ListFilesInDir.main(ListFilesInDir.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Solution:
It is because that Spark-1.2.0 uses scala-2.10 as default, whereas our scala project is compiled in scala-2.11.

If we intending to compile Spark with scala-2.11, some explicit operations need to be done at first which is well-stated in official document:
dev/change-version-to-2.11.sh
mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package


Possible Error 3)
When running Spark task in yarn-cluster mode like:
spark-submit --class "com.hide.region.distribution.EtlRegionDist" --master yarn-cluster --num-executors 3 scala.test.jar

The most commonly-seen error would be:
Exception in thread "main" org.apache.spark.SparkException: Application finished with failed status
 at org.apache.spark.deploy.yarn.ClientBase$class.run(ClientBase.scala:504)
 at org.apache.spark.deploy.yarn.Client.run(Client.scala:35)
 at org.apache.spark.deploy.yarn.Client$.main(Client.scala:139)
 at org.apache.spark.deploy.yarn.Client.main(Client.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

When checking the detailed error info via `yarn logs -applicationId <app_id> > log.out`, we could not find 'ERROR' or 'FATAL' in the output file. This is because the exception is printed in INFO level occasionally. Thus, we have to search for keyword 'Exception' (or ignore case: 'exception\c') to pinpoint the real culprit.


Possible Error 4)
When importing the 'spark-assembly-1.2.0-hadoop2.2.0.jar' file in Project Structure(⌘+;), Intellij Idea prompts that 'IDEA cannot determine what kind of files the chosen items contain. Choose the appropriate categories from the list.'. Though a prompt it is, after you choose 'classes' and apply it to current project, there will be so many grammar errors in your code, however, it will succeed if you build your project(⌘+F9).

A distinct characteristic between a normal .jar file and this 'spark-assembly-1.2.0-hadoop2.2.0.jar' file is that the latter one's content cannot be extracted in the selection dialog:

I put this phenomenon on stackoverflow, and as of now, no answer has been provided. Anyway, a cumbersome solution has been found by myself: Open the jar file with Archive Expert in Mac OSX, select 'Extract All' to a folder, then go into this folder in Finder, select all content, right click and select 'Compress .. Items', this will generate a .zip file, rename it to .jar, say, 'spark-assembly-1.2.0-hadoop2.2.0_regen.jar'. This new jar file can be recognized by Intellij Idea, thus no grammar error will be found. However, when building the project, 'Package XXX does not exist' will be encountered. Consequently, we have to import this jar file together with the original one: 'spark-assembly-1.2.0-hadoop2.2.0.jar'. In this time, both grammar and building will not shoot any error anymore.



Possible Error 5)
When writing spark program in Intellij Idea, methods like "reduceByKey", "join", which all seems to be in PairRDDFunctions, will complains grammar error, and no smart hint is available:

This is due to the improper import of class SparkContext. What we need to do is to append `import org.apache.spark.SparkContext._` to the import zone.








8 comments:

  1. Thank you Jason. It's rare to see someone take the time to document this kind of project setup with such detail. The error resolutions are particularly helpful.

    ReplyDelete
  2. Since Spark was not that fully-fledged at that time when the latest version was still 1.2.0, I found it necessary to take down some notes on all the 'gotchas' upon Spark. I'd be happy if it helps :]

    ReplyDelete
  3. Thanks to share great Apache Spark tips , but i request share it in eclipse instead of intellij. thanks to share

    ReplyDelete
  4. Thank you. Possible Error 4 is really helpful.

    ReplyDelete
  5. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Apache Spark and Scala, kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training on TECHNOLOGY. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Sangita Mohanty
    MaxMunus
    E-mail: sangita@maxmunus.com
    Skype id: training_maxmunus
    Ph:(0) 9738075708 / 080 - 41103383
    http://www.maxmunus.com/

    ReplyDelete