This is first article in Big Data series.
Introduction
Sometimes in Data Scientist work we need to perform analysis on CSV files. In this article I want to compare performance of different tools in Real-World Use Case. Described tools include:
- R (using data.tables library)
- Python (using Pandas)
- Spark 2.0 (using Spark SQL)
We used Spark 2.0, because it has built-in CSV reader. (As an alternative we can use Spark 1.6 with Spark CSV library)
Task
Described task will consist of three parts:
- Read data from CSV file with columns
Datacenter,Server,LogLevel,LogMessage
- Filter data with
LogLevel = "ERROR"
- Aggregate data and calculate counts grouped by
Datacenter,Server
Sample data
Sample CSV file has following format
1 2 |
Datacenter,Server,LogLevel,LogMessage 0,server-0.example.com,DEBUG,Sample log message 0 |
Sample was generated using following Python code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
SERVERS = 100 DC = 5 TYPES = ['DEBUG', 'INFO', 'WARN', 'ERROR'] num = 1000 with open('log{}.csv'.format(num), 'w+') as f: for i in range(num): f.write('{dc},{server},{type},{message}\n' .format(dc=i%DC, server='server-{}.example.com'.format(i%SERVERS), type=TYPES[int(i/SERVERS)%4], message='Sample log message {}'.format(i))) |
We tested different sample sizes. Files were uncompressed, because there is no simple way to read compressed CSV with data.table.
Record count | CSV file size |
---|---|
103 = 1000 | 52 kB |
104 = 10000 | 521 kB |
105 = 100000 | 5,2 MB |
106 = 1000000 | 53 MB |
107 = 10000000 | 537 MB |
108 = 100000000 | 5,4 GB |
109 = 1000000000 | 54 GB |
We stopped at 109 records because R and Python tried to use too much memory to process it (more than 128 GB).
Sample Code
R
1 2 3 |
library(data.table) dat <- fread("log1000.csv", sep = ",") dat[V3=="ERROR",.N, list(V1, V2)] |
Python
1 2 3 |
import pandas as pd df = pd.read_csv('log1000.csv', names=['Datacenter', 'Server', 'LogLevel', 'LogMessage']) df[df["LogLevel"] == 'ERROR'].groupby(['Datacenter', 'Server']).count() |
Spark
1 2 |
df = spark.read.csv('log1000.csv') df.where("LogLevel == 'ERROR'").groupBy("Datacenter", "Server").agg(functions.count("LogMessage")).collect() |
Test Environment
We were using 3 servers with MooseFS as distributed file system. Servers have 20 cores (40 with HT) and 128 GB RAM each. Spark was configured with 60 cores and 60GB RAM (20 cores and 20GB RAM on each server). Python and R was single threaded application (with maximum of 128 GB RAM).
Results
Record count | Python | R | Spark |
---|---|---|---|
103 = 1000 | 5,4 ms | 3 ms | 240 ms |
104 = 10000 | 15,5 ms | 7 ms | 250 ms |
105 = 100000 | 122 ms | 45 ms | 320 ms |
106 = 1000000 | 1,15 s | 420 ms | 400 ms |
107 = 10000000 | 11,9 s | 5,8 s | 780 ms |
108 = 100000000 | 119 s | 81 s | 4,05 s |
109 = 1000000000 | Out of memory | Out of memory | 33,8 s |
Conclusion
For smaller data (less than 1 million records) single threaded solutions are faster (with R being 2 times faster than Python). Spark needs some time (about 200-400ms) for task initialization. For bigger data Spark takes the lead.
Author
Maciej Bryński is Big Data and Data Science Team Leader in IIIT