Perhaps this sample classification using a decision tree and a random forest in Spark 2.2 can be useful for you (see code below and attached). The code reads a CSV file with the training/testing data. Notice that this code use Spark ML instead of MLlib (the "old" way) and that it is in Scala
Best Regards,
Heitor
import org.apache.log4j.{Level, Logger} import org.apache.spark.ml.classification.DecisionTreeClassifier import org.apache.spark.ml.classification.RandomForestClassifier import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.sql.{DataFrame, SparkSession} object RandomForestDecisionTreeExample { val session = SparkSession.builder .master("local") .appName("spark-rf-dt-example") .config("spark.some.config.option", "config-value") .getOrCreate() def main(args: Array[String]): Unit = { // This is just to avoid thousands of INFO messages in the output... Logger.getLogger("org").setLevel(Level.ERROR) classificationExample() session.stop() } def classificationExample(): Unit = { // This reads the data from a CSV file // Notice that my file has no header, thus I set header to false val irisDataFrame : DataFrame = session .read .option("header", false) .option("inferSchema", true) .csv("../data/iris.csv") // This show the schema, i.e., which features there are and their assigned names. irisDataFrame.printSchema() // This splits the data in training and testing val Array(trainData, testData) = irisDataFrame.randomSplit(Array(0.7, 0.3)) trainData.cache() testData.cache() // This creates a view of the training data that does not contain the _c4 attribute, which in my case is my class label val inputCols = trainData.columns.filter(_ != "_c4") // The VectorAssembler creates another column in my training dataframe named X containing all the input columns, i.e., those that are not _c4 val assembler = new VectorAssembler(). setInputCols(inputCols). setOutputCol("X") // After creating the VectorAssembler object I can transform the trainData val assembledTrainData = assembler.transform(trainData) // This is just to take a look at the new X column, which is a vector with all the input data. assembledTrainData.select("X").show(truncate = false) // Creates a decisionTree. Specifying the input X and output (class label) column _c4, the predictions will bet outputted to a new column named prediction val dtClassifier = new DecisionTreeClassifier().setSeed(1).setLabelCol("_c4").setFeaturesCol("X").setPredictionCol("prediction") val dtModel = dtClassifier.fit(assembledTrainData) // Creates a random forest. Very similar to the decision tree specification, but here I also set the numTrees just to set another hyperparameter val rfClassifier = new RandomForestClassifier().setSeed(1).setLabelCol("_c4").setNumTrees(10).setFeaturesCol("X").setPredictionCol("prediction") val rfModel = rfClassifier.fit(assembledTrainData) // Evaluation on the test set // First create the X column in the test data as well. val assembledTestData = assembler.transform(testData) // Obtain the predictions for the decision tree val dtPredictions = dtModel.transform(assembledTestData) // ... and for the random forest val rfPredictions = rfModel.transform(assembledTestData) // Create the evaluator object val evaluator = new MulticlassClassificationEvaluator().setLabelCol("_c4").setPredictionCol("prediction") // Obtain the accuracy for the decision tree model val dtTestAccuracy = evaluator.setMetricName("accuracy").evaluate(dtPredictions) // .. and random forest model val rfTestAccuarcy = evaluator.setMetricName("accuracy").evaluate(rfPredictions) println("test accuracy decision tree = %f and random forest = %f".format(dtTestAccuracy, rfTestAccuarcy)) // Bonus! The features importance according to MDI from the trained random forest model. rfModel.featureImportances.toArray.zip(Seq("_c0", "_c1", "_c2", "_c3")).sorted.reverse.foreach(println) } }
To test the performance of SVM algorithms, decision tree, naive bayes and random forest in the detection of network intrusions on the UNSW-NB15 dataset using apche sparke how can i import and read the database UNSW-NB15?
Where can I find the scripts of these algorithms ?
I am not very familiar with the UNSW-NB15 dataset, though I can see (url below) in here that it has a file describing the features UNSW-NB15_features.csv and csv files with the actual data, e.g. UNSW-NB15_1.csv. You can read the features file to define the schema (i.e. name the columns for convenience) or you can just infer the schema as I did in the code in my previous reply.