Pranjal Gupta
Pranjal Gupta PhD Student in Cognitive Neuroscience at Duke University working with Dr. John Pearson on online dimensionality reduction.

Big Data & Complexity

Big Data & Complexity

As human beings continue to march through the information age, we produce ever-larger quantities of data every second. This is especially relevant to us as scientists, as data gathering and data analysis has dramatically changed even in the past decade due to advances in various technologies. We may find ourselves ploughing through gigabytes or even terabytes of data to find some kind of meaning or structure. In order to handle a large volume of data in a timely manner without gobbling up all of your computer’s resources, it is important to understand computational complexity and the tools available to help manage big datasets.

First we will briefly go over some theory of computational complexity, then we will get to some practical examples.

bigO.jpeg

Big O

In analyzing how long an algorithm will take to run on some data, it is often useful to describe how the algorithm’s time grows with increasing sizes of data. Big O notation is a way of describing how the time taken to run an algorithm relates to the size of the input to the algorithm. For example, if an algorithm takes a single input argument called data which is a list of \(n\) numbers, we would like to know how long the algorithm might take as \(n \rightarrow \infty\). This is known as the time complexity of an algorithm. (Memory complexity also exists, which indicates how much memory storage is required to run an algorithm. Often, there is a tradeoff between memory complexity and time complexity.)

Imagine our job is to simply print the 5th element of data. This would be an operation that basically takes a single computation: accessing the 5th element. No matter how long the list is, printing the 5th element would always take the same amount of time. These kinds of operations are known as having constant complexity, noted in Big O notation as \(\textrm{O}(1)\).

Now imagine our job is to get the sum of data. We must access every element of the list to get the total sum:

1
2
3
4
5
define get_sum(data): 
    sum = 0
    for i = 1 to n:
        sum = sum + data[n] # update sum
    return sum

If we know the code sum = sum + data[n] takes some amount of time \(k\) then we know that this algorithm will take something like \(kn\) time, which can be put into Big O notation as \(\textrm{O}(kn)\) time. This is considered linear time, as it is linear in the size of the data. Since Big O is concerned about describing how algorithms scale to arbitrarily large datasets (i.e., getting the sum of an extremely large list), it (usually) doesn’t matter what \(k\) is, as it does not depend on \(n\). Big O notation ignores terms which don’t change with data size, so this sum algorithm would be denoted as having a time complexity of \(\textrm{O}(n)\).

One way to think about big O notation in terms of your code is to consider any nested loops in your code. Each new nested loop of size n will cause a multiplicative effect on the execution time. For example:

1
2
3
for i = 1 to n: 
    for j = 1 to n:
        do_something()

This will cause do_something() to be executed \(n^2\) times, indicating that this algorithm could have a complexity of \(\textrm{O}(n^2)\). The real complexity depends on what do_something() actually does.

Similar to how Big O notation ignores constants that don’t grow with the data, it also only considers the largest class of complexity (see figure at the beginning for a visual depiction of some complexity classes). For example, consider writing some code that finds some pairwise metric of your data:

1
2
3
for i = 1 to n:
    for j = i+1 to n:                    # now looping from i+1
        do_comparison(data[i], data[j])  # data[i] could be a number, a vector, etc.

Then the inner command is executed \(\frac{n(n-1)}{2}\) times (from doing the combinatorial \(n \choose 2\) comparisons). As above , we are interested in how the algorithm scales to large data; dividing by the constant 2 is ignored, so this pairwise metric algorithm is \(\textrm{O}(n^2 - n)\). Notice that as \(n\) gets arbitrarily large, the \(n^2\) term dominates the execution time. That is, as \(n \rightarrow \infty\), we have \(n^2 - n \approx n^2\), indicating that the above code block has a time complexity of \(\textrm{O}(n^2)\).

Complexity of common data processing algorithms

  • Sorting is a pretty key thing to do with data. Sorting algorithms, on average, have a time complexity of \(\textrm{O}(n \log n)\). This is pretty scalable.
  • As above, considering pairwise interactions of a set of data is \(\textrm{O}(n^2)\). For me, there are times when this is fine, and times when it is not (consider getting pairwise interactions of 100,000 data points).
  • Solving systems of linear equations often involves something like a matrix inversion, which is at maximum \(\textrm{O}(n^3)\), where \(n\) is the dimension of the square matrix (column and row vectors). Matrix multiplication is also around this complexity. The true complexity of these operations something like \(\textrm{O}(n^{2.807})\) due to fast algorithms. For me, this isn’t unscalable, but sometimes can be a pain.
  • Doing PCA/SVD on a matrix \(A \in \mathbb{R}^{(n \times t)}\) (a matrix with \(n\) rows and \(t\) columns) has complexity \(\textrm{O}( \min (n^2t, nt^2) )\). That is, it is squared in either \(n\) or \(t\), depending on which is larger. I work on streaming/online algorithms, so doing SVD over and over again on the incoming data isn’t an option, as it gets really slow really fast.
  • Sampling from distributions: it depends on what you’re sampling from and what you end up using it for.

Practical considerations

Strategies for speeding up computations on large data

1) Vectorizing and Broadcasting (numpy)

Numpy offers a few ways to speed up code execution. Vectorizing and broadcasting are two common ways which can speed up code execution by avoiding python loops. Vectorizing numpy code doesn’t avoid loops by parallelizing computations, rather it relies on extremely fast loops written outside of python (for details see https://chelseatroy.com/2018/11/07/code-mechanic-numpy-vectorization/.)

Vectorizing is the concept of avoiding writing a for loop when dealing with numpy arrays by “stacking” or “folding” common operations into a single instruction. One simple example involves the sum operation:

1
2
3
4
# first load libraries
import time                      # python standard library for all things timing
import numpy as np               # library to manipulate arrays
import matplotlib.pyplot as plt  # library to plot things
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# sum using a python for loop
def python_sum(a):
    sum = 0
    for i in range(a.shape[0]):
        sum += a[i]
    return n

# sum using numpy's internal sum (uses C for loops)
def vectorized_sum(a):
    return a.sum()

# timing comparison
n = 1000                        # length of vector
a = np.random.normal(size=(n))  # random gaussian vector

print('python loop:')
%timeit -n 1000 -r 5 python_sum(a)  # ipython/jupyter magic to time a function
print('vectorized:')
%timeit -n 1000 -r 5 vectorized_sum(a)
1
2
3
4
python loop:
1000 loops, best of 5: 254 µs per loop
vectorized:
1000 loops, best of 5: 2.88 µs per loop

So, under most circumstances, use built in functions!! They’re pretty fast. Another concrete example of avoiding loops with built in functions: finding the Euclidean distance between large vectors (images):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# loop version
def l2_loop(image1, image2):
    height, width, channels = image1.shape
    distance = 0
    
    for h in range(height):
        for w in range(width):
            for c in range(channels):
                distance += (image1[h][w][c] - image2[h][w][c])**2

# vectorised version
def l2_vectorise(image1, image2):
    ((image1 - image2)**2).sum()


# simulate Gaussian white noise images
h, w, c = 520, 640, 3 # height (pixels), width, and colors
image1 = np.random.normal(size=(h,w,c))
image2 = np.random.normal(size=image1.shape)

# time functions
%timeit -n 10 -r 3 l2_loop(image1, image2)
%timeit -n 10 -r 3 l2_vectorise(image1, image2)
1
2
10 loops, best of 3: 1.89 s per loop
10 loops, best of 3: 2.98 ms per loop

There are many ways to convert loops into vectorized or broadcasted functions, which depend on the structure of the loops. I usually write loops over numpy arrays on the first pass, to just have some code in, then I go back and think about how those loops could be vectorized.

2) Parallelizing code

There are a few ways of parallelizing code: on your CPU, or on your GPU (or TPU on Google!)

CPUs most often contain on the order of 4 to 16 cores. Cores are computing resources. One way to think about this is that cores are the things that can execute one command at a time. CPU cores are extremely fast. Parallelizing on CPU involves writing code that takes advantage of all available cores (through threading or processes). Most functions that exist in numpy, scipy, sklearn (or R versions of these) already take advantage of these tools, so I haven’t found myself needing to worry about writing code to be paralellized on CPU.

GPUs are Graphics Processing Units, which are processors originally built to draw graphics quickly (such as for computer displays, animations, or video games). GPUs contain thousands of cores, each of which are individually much slower than CPU cores. GPUs can speed up computations by performing many computations simultaneously across many cores, rather than performing computations sequentially across a few cores. TPUs are Tensor Processing Units, and are Google manufactured processors specifically made for doing parallel computations involved with linear algebra or taking gradients for deep learning.

Running parallelized data processing code on GPU has become easier over the years, but actually seeing performance gains by using the GPU can be tricky. There is a transfer involved in getting the data from the CPU RAM to the GPU RAM (yes, the GPU has its own RAM). It turns out that Google Colab offers GPU and TPU resources, but doing appropriate comparisons is a bit tricky.

First, let’s see how matrix multiplication does on the CPU:

1
2
3
4
5
6
7
8
# doing matrix-matrix multiplication
for n in [100, 500, 1000, 2500, 5000]:  # testing out range of matrix sizes
    A = np.random.normal(size=(n, n))
    B = np.random.normal(size=A.shape)
    start = time.time()
    C = A @ B
    end = time.time() - start
    print('time (s): {:.3f} \t'.format(end), 'n = {}'.format(n))
1
2
3
4
5
time (s): 0.006 	 n = 100
time (s): 0.009 	 n = 500
time (s): 0.071 	 n = 1000
time (s): 0.957 	 n = 2500
time (s): 7.344 	 n = 5000

Now let’s switch Google Colab to use GPU: in the top left, go to Edit -> Notebook settings and go to the drop down menu of hardware acceleration, and pick GPU or TPU. Then, we have to reload the libraries, as changing the hardware causes a new ipython kernel to start. We can see that the GPU is faster, but not consistently.

1
2
3
4
5
6
7
8
9
10
11
import numpy as np
import time

# doing matrix-matrix multiplication
for n in [100, 500, 1000, 2500, 5000]:  # testing out range of matrices
    A = np.random.normal(size=(n, n))
    B = np.random.normal(size=A.shape)
    start = time.time()
    C = A @ B
    end = time.time() - start
    print('time (s): {:.3f} \t'.format(end), 'n = {}'.format(n))
1
2
3
4
5
time (s): 0.008 	 n = 100
time (s): 0.006 	 n = 500
time (s): 0.036 	 n = 1000
time (s): 0.522 	 n = 2500
time (s): 4.196 	 n = 5000

GPUs don’t necessarily make the execution time faster. While the computation of the matrix-matrix product may be faster on GPU, doing this requires the machine to copy the data from the computer RAM to the GPU RAM. This can take time.

My personal experience with using GPUs to accelerate matrix multiplications or decompositions is that it’s not worth it unless your matrices have more than 1000 rows and/or columns.

3) Reduce the size of your data (dimension reduction, randomizing, sketching, etc.)

Here are a grab-bag of strategies to use. Some of these may be useful heuristically, while others are active areas of research. The idea behind most of these is to produce a dataset that is smaller than the original (perhaps to fit into RAM) but which maintains some set of properties about the original.

Dimension reduction is a well-known class of methods to produce a smaller dataset. There are many, many methods out there, each of which preserves a different property of the dataset. Unfortunately, doing dimension reduction on large datasets is pretty difficult - some groups have developed computer clusters specifically to perform linear dimension reduction (PCA) on huge datasets.

You may try fitting a dimension reduction method on as much data as you can load into RAM, and then use that solution as the initialization for reducing the next set of data that is loaded. Both this paper and this paper used this kind of approach; they dealt with doing SVD on video data, which are much too large to fit into memory.

You could try sampling your huge dataset to produce a smaller dataset with similar properties. This is the idea behind randomized algorithms, such as randomized SVD, etc. Of course, these algorithms have less guarantees and may not preserve the property you are interested in.

Concepts like sketching are also in this line of thought, though sketching is based on some fancier math.

Computer memory

The hard disk is the permanent memory of your computer. When you save any file or data to your computer, it’s saved to disk. It’s a bit like writing something down in a notebook - you can access it when necessary, but you’ll probably forget about it at some point.

RAM stands for Random Access Memory. Roughly, RAM corresponds to the amount of data that your computer can quickly access at any particular moment. RAM is like having something in your short-term memory. When you do something like load_data(file_location) in a script or interactive environment, the data is read from disk and then written to (takes up space in) RAM. Loading data is like reading something from a notebook so that you can actively work with that information. Readily having data available in RAM is necessary to visualize, model, or otherwise process and analyze it in some way.

It is often the case that we can trade memory resources for time resources. Pre-allocating matrices for speed is the most common example of this. The following code builds a list from the ground up:

1
2
3
4
%%timeit -n 10
arr = np.random.randn(1,10)
for i in range(10000 - 1):
        arr = np.concatenate((arr, np.random.rand(1,10))) # concatenating
1
10 loops, best of 5: 597 ms per loop

Whereas this code preallocates, or builds the entire list first, then fills in the list.

1
2
3
4
%%timeit -n 10
arr = np.zeros((10000,10) # preallocating
for i in range(10000):
    arr[i] = np.random.rand(1,10)
1
10 loops, best of 5: 17.4 ms per loop

As we can see, preallocating memory is about 25x faster. Thus, if you know the size of your data before processing, always make sure to preallocate (unless RAM is an issue).

In situations where the size of the data will be unknown, one way to avoid costly numpy copying could be to create a python list of numpy arrays, and append a numpy array to the python list for each new entry. This is because appending to a python list is a constant \(\textrm{O}(1)\) operation. After creating this python list of numpy arrays, we could convert the this list into an overall numpy array.