To test the performance of SVM algorithms, decesion tree on the UNSW-NB15 data sets in apache spark, how I can import, read CSV files in apache spark?

Hi Rbah,

Perhaps this sample classification using a decision tree and a random forest in Spark 2.2 can be useful for you (see code below and attached). The code reads a CSV file with the training/testing data. Notice that this code use Spark ML instead of MLlib (the "old" way) and that it is in Scala

Best Regards,

Heitor

import org.apache.log4j.{Level, Logger} import org.apache.spark.ml.classification.DecisionTreeClassifier import org.apache.spark.ml.classification.RandomForestClassifier import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.sql.{DataFrame, SparkSession} object RandomForestDecisionTreeExample { val session = SparkSession.builder .master("local") .appName("spark-rf-dt-example") .config("spark.some.config.option", "config-value") .getOrCreate() def main(args: Array[String]): Unit = { // This is just to avoid thousands of INFO messages in the output... Logger.getLogger("org").setLevel(Level.ERROR) classificationExample() session.stop() } def classificationExample(): Unit = { // This reads the data from a CSV file // Notice that my file has no header, thus I set header to false val irisDataFrame : DataFrame = session .read .option("header", false) .option("inferSchema", true) .csv("../data/iris.csv") // This show the schema, i.e., which features there are and their assigned names. irisDataFrame.printSchema() // This splits the data in training and testing val Array(trainData, testData) = irisDataFrame.randomSplit(Array(0.7, 0.3)) trainData.cache() testData.cache() // This creates a view of the training data that does not contain the _c4 attribute, which in my case is my class label val inputCols = trainData.columns.filter(_ != "_c4") // The VectorAssembler creates another column in my training dataframe named X containing all the input columns, i.e., those that are not _c4 val assembler = new VectorAssembler(). setInputCols(inputCols). setOutputCol("X") // After creating the VectorAssembler object I can transform the trainData val assembledTrainData = assembler.transform(trainData) // This is just to take a look at the new X column, which is a vector with all the input data. assembledTrainData.select("X").show(truncate = false) // Creates a decisionTree. Specifying the input X and output (class label) column _c4, the predictions will bet outputted to a new column named prediction val dtClassifier = new DecisionTreeClassifier().setSeed(1).setLabelCol("_c4").setFeaturesCol("X").setPredictionCol("prediction") val dtModel = dtClassifier.fit(assembledTrainData) // Creates a random forest. Very similar to the decision tree specification, but here I also set the numTrees just to set another hyperparameter val rfClassifier = new RandomForestClassifier().setSeed(1).setLabelCol("_c4").setNumTrees(10).setFeaturesCol("X").setPredictionCol("prediction") val rfModel = rfClassifier.fit(assembledTrainData) // Evaluation on the test set // First create the X column in the test data as well. val assembledTestData = assembler.transform(testData) // Obtain the predictions for the decision tree val dtPredictions = dtModel.transform(assembledTestData) // ... and for the random forest val rfPredictions = rfModel.transform(assembledTestData) // Create the evaluator object val evaluator = new MulticlassClassificationEvaluator().setLabelCol("_c4").setPredictionCol("prediction") // Obtain the accuracy for the decision tree model val dtTestAccuracy = evaluator.setMetricName("accuracy").evaluate(dtPredictions) // .. and random forest model val rfTestAccuarcy = evaluator.setMetricName("accuracy").evaluate(rfPredictions) println("test accuracy decision tree = %f and random forest = %f".format(dtTestAccuracy, rfTestAccuarcy)) // Bonus! The features importance according to MDI from the trained random forest model. rfModel.featureImportances.toArray.zip(Seq("_c0", "_c1", "_c2", "_c3")).sorted.reverse.foreach(println) } }

Heitor Murilo Gomes

Hi Rbah,

I am not very familiar with the UNSW-NB15 dataset, though I can see (url below) in here that it has a file describing the features UNSW-NB15_features.csv and csv files with the actual data, e.g. UNSW-NB15_1.csv. You can read the features file to define the schema (i.e. name the columns for convenience) or you can just infer the schema as I did in the code in my previous reply.

https://www.unsw.adfa.edu.au/australian-centre-for-cyber-security/cybersecurity/ADFA-NB15-Datasets/

The scripts of the algorithms? You mean the spark implementations of SVM, Random forest, etc? These are available on Apache Spark github:

https://github.com/apache/spark

Regarding how to set them as IDS, this paper should help you defining it:

http://ieeexplore.ieee.org/document/7348942/

Best Regards,

Heitor

Sayak Paul

Here is an easy guide: https://docs.databricks.com/spark/latest/data-sources/read-csv.html

Cedric Lemaitre

python pandas : https://github.com/pandas-dev/pandas

Rbah Yahya

Hi Heitor

thank you :) for your answers,

To test the performance of SVM algorithms, decision tree, naive bayes and random forest in the detection of network intrusions on the UNSW-NB15 dataset using apche sparke how can i import and read the database UNSW-NB15?

Where can I find the scripts of these algorithms ?

and how to set them up as IDS?

Regards,

Mohammed Al-Rudaini

using spark v2.3 sparksession read function

https://www.safaribooksonline.com/library/view/apache-spark-2x/9781787126497/8603f16b-9c0a-47b4-8434-5d58c48c29ac.xhtml

What is the purpose of internal control in PCR procedure?

How can I interpret these photos?

What do the graphs mean?

Difference between ACI Code and Eurocode in Stiffness modifiers?

Is there any website providing what nanoparticles have been synthesized so far?

Is the active site prediction tool from (Supercomputing Facility for Bioinformatics & Computational Biology, IIT DELHI) feasible?

At 425 nm a 1.2 x 10-3 M solution of compound Q has Absorbance = 0.879. Another solution of Q is prepared by diluting 25.00 mL of the 1.2 x 10-3 M ?

Convert java code to xml file ?

is there any free courses to learn Labview ? or a good website to learn it quickly?

What is the use of (formyl methionine ) in the process of translation in prokaryotes and Eukaryotes?

Hello researchers Is this a random laser or just fluorescence?

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

Need help with my research project on open source SIEM and machine learning?

What are the limitations and challenges of using machine learning for predicting concrete compressive strength in practical applications?

How to choose the journal?

A Question about Phd thesis?

The use of data from PubChem for commercial purposes?

Are these cassettes suitable for expressing PETase mutant in E. coli?

Please, what is the memory consumption of the Matlab function quad tree decomposition procedure [S = qtdecomp(I)] with respect to the input set I?

How can we improve transfer learning techniques to make models generalize better across different tasks and domains with limited labeled data?