Performance benchmark on processing CSV data

This is first article in Big Data series. Introduction Sometimes in Data Scientist work we need to perform

This is first article in Big Data series.

Introduction

Sometimes in Data Scientist work we need to perform analysis on CSV files. In this article I want to compare performance of different tools in Real-World Use Case. Described tools include:

  • R (using data.tables library)
  • Python (using Pandas)
  • Spark 2.0 (using Spark SQL)

We used Spark 2.0, because it has built-in CSV reader. (As an alternative we can use Spark 1.6 with Spark CSV library)

Task

Described task will consist of three parts:

  1. Read data from CSV file with columns Datacenter,Server,LogLevel,LogMessage
  2. Filter data with LogLevel = "ERROR"
  3. Aggregate data and calculate counts grouped by Datacenter,Server

Sample data

Sample CSV file has following format

Sample was generated using following Python code.

We tested different sample sizes. Files were uncompressed, because there is no simple way to read compressed CSV with data.table.

Record count CSV file size
103 = 1000 52 kB
104 = 10000 521 kB
105 = 100000 5,2 MB
106 = 1000000 53 MB
107 = 10000000 537 MB
108 = 100000000 5,4 GB
109 = 1000000000 54 GB

We stopped at 109 records because R and Python tried to use too much memory to process it (more than 128 GB).

Sample Code

R

Python

Spark

Test Environment

We were using 3 servers with MooseFS as distributed file system. Servers have 20 cores (40 with HT) and 128 GB RAM each. Spark was configured with 60 cores and 60GB RAM (20 cores and 20GB RAM on each server). Python and R was single threaded application (with maximum of 128 GB RAM).

Results

Record count Python R Spark
103 = 1000 5,4 ms 3 ms 240 ms
104 = 10000 15,5 ms 7 ms 250 ms
105 = 100000 122 ms 45 ms 320 ms
106 = 1000000 1,15 s 420 ms 400 ms
107 = 10000000 11,9 s 5,8 s 780 ms
108 = 100000000 119 s 81 s 4,05 s
109 = 1000000000 Out of memory Out of memory 33,8 s

Conclusion

For smaller data (less than 1 million records) single threaded solutions are faster (with R being 2 times faster than Python). Spark needs some time (about 200-400ms) for task initialization. For bigger data Spark takes the lead.

Author

Maciej Bryński is Big Data and Data Science Team Leader in IIIT

Leave a Comment