Hi TypeDB community,
It would be great if you could help me shed some light on my failed attempts at reducing the execution time of a bulk load when using the Python driver in TypeDB 2.29.0.
The initial approach consisted of a single process with batches, which takes around 8 minutes to load ~42000 insert queries. The concerned functions roughly look as follows:
def load(self, data: list[str], batch_size=100):
amount = len(data)
with TypeDB.core_driver(self.server_address) as driver:
with driver.session(self.db_name, SessionType.DATA) as session:
for i in range(0, amount, batch_size):
self.load_batch(session, data[i:i+batch_size])
def load_batch(self, session, queries: list[str]):
with session.transaction(TransactionType.WRITE) as transaction:
for query in queries:
transaction.query.insert(query)
transaction.commit()
To speed up the loading, I tried to implement some form of multi-threading.
As per the Python API documentation, I used TypeDBOptions(parallel=True)
to enable the use of parallel execution in the server; this in combination with a ThreadPoolExecutor
from the concurrent.futures
Python module. The modified load
function looks as follows:
from concurrent.futures import ThreadPoolExecutor
from functools import partial
def load(self, data: list[str], batch_size=100):
amount = len(data)
batches = []
for i in range(0, amount, self.batch_size):
batches.append(data[i:i+self.batch_size])
with TypeDB.core_driver(self.server_address) as driver:
with driver.session(self.db_name, SessionType.DATA, TypeDBOptions(parallel=True)) as session:
with ThreadPoolExecutor(max_workers=os.cpu_count()) as executor:
executor.map(partial(self.load_batch, session), batches)
However, the runtime remains unchanged. Some additional notes here:
- It seemed more natural to run several threads concurrently committing transactions in the same session (as in the code above), however, I also tried first shifting the session context manager into the
load_batch
function and then shifting also the driver, so that multiple drivers were run in parallel; yet, there was no difference in the execution time. - I played with the batch size without successful results either.
- I also followed the example in the documentation: TypeDB | Docs > Manual > Optimizing speed, although the parallel option is not considered. No noticeable difference was observed either.
At this point, I’d appreciate your help clarifying the following questions:
- As I understand,
TypeDBOptions(parallel=True)
acts on the server, so if it is enabled in the session, it should remain enabled for the transactions too. Is this correct? (in any case, i tried adding the parameter to both, the session and transaction, with no noticeable difference) - Do any settings need to be modified in the server for the parallelization to take effect?
- Could it be that due to the complexity of the inserts (perhaps too many matching statements) this is the best performance that can be achieved?
- Is there perhaps a flaw in the implementation above that I’m missing?
Any insight or suggestions you can provide will be greatly appreciated.
Thanks in advance