Skip to content

[Documentation] Explain performance improvements #670

@jan-janssen

Description

@jan-janssen

Generate data:

import numpy as np
import pandas as pd

N = 1_000_000
data = pd.DataFrame({
    "c": np.random.choice(["a", "b", "c"], size=N),
    "x": np.random.uniform(size=N),
    "y": np.random.normal(size=N)
})

data.to_csv("blob.csv")  # File is about 45 Mb

Slow execution: 24.1 s ± 115 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

import numpy as np
import pandas as pd
from executorlib import SingleNodeExecutor

def get_sum(i, df):
    return i, df["x"].sum(), df["y"].sum()

with SingleNodeExecutor(max_workers=10) as exe:
    future_lst = [exe.submit(get_sum, df=pd.read_csv("blob.csv"), i=i) for i in range(100)]
    result_lst = [f.result() for f in future_lst]

Reduce the startup time for the processes: 19.5 s ± 31.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

import numpy as np
import pandas as pd
from executorlib import SingleNodeExecutor

def get_sum(i, df):
    return i, df["x"].sum(), df["y"].sum()

with SingleNodeExecutor(max_workers=10, block_allocation=True) as exe:
    future_lst = [exe.submit(get_sum, df=pd.read_csv("blob.csv"), i=i) for i in range(100)]
    result_lst = [f.result() for f in future_lst]

Load the data only once for each process: 946 ms ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

import numpy as np
import pandas as pd
from executorlib import SingleNodeExecutor

def get_sum(i, df):
    return i, df["x"].sum(), df["y"].sum()

def init_funct():
    return {"df": pd.read_csv("blob.csv")}

with SingleNodeExecutor(max_workers=10, block_allocation=True, init_function=init_funct) as exe:
    future_lst = [exe.submit(get_sum, i=i) for i in range(100)]
    result_lst = [f.result() for f in future_lst]

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions