Re: Fetching lots of objects

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Re: Fetching lots of objects

Musall, Maik
Hi Andrus,

I'm continuing this on the dev@ list if you don't mind?

> Am 08.03.2017 um 20:13 schrieb Andrus Adamchik <[hidden email]>:
>> It would be nice if Cayenne would internally parallelize things like ObjectResolver.objectsFromDataRows() and use lock-free strategies to deal with the caching.
> This is probably the last (and consequently the worst) place in Cayenne where locking still occurs. After I encountered this problem in a high-concurrency system, I've done some analysis of it (see [1] and also [2]), and this has been my "Cayenne 5.0" plan for a long time. With 4.0 making such progress as it does now, we may actually start contemplating it again.
> Andrus
> [1]
> [2]

Interesting read!

Regarding the array-based DataObject concept, wouldn't this mean for name-based attribute lookups that you still need a map somewhere that translates names to indexes? That map would only be needed once per entity, however.

Instead of the array-based approach, did you also consider ConcurrentHashMap and similar classes in java.util.concurrent? It would not have all the other advantages besides concurrency, but could perhaps serve as an easy intermediate step to get rid of the locking, and be implemented even in 4.0 already.

And on the [1] discussion, I'd like to mention my use case again: big queries with lots of prefetches to suck in gigabytes of data for aggregate computations using DataObject business logic. During those fetches, other users expect to be able to continue their regular workload concurrently (which they mostly cannot using EOF: my main reason to switch). So however this [1] concept turns out, I'd like to also be able to parallelize the fetches themselves. A useful first step would be to execute disjoint prefetches in separate threads.

A second step could be to have even a single big table scan query parallelized by partioning. Databases have been able to organize large tables into partitions that can be scanned independently from each other. Back in the days with Oracle and slower spinning disks you would spread partitions between independent disks, while today with SSDs and zero seek time that could still help to increase the throughput when CPU is the limiting factor (databases also tend to generate high CPU loads when doing full table scans, but only on one core per scan). An idea could be to include a partitioning criterium in the model, which matches the database's criterium for the table in question.

In the meantime I could try partitioning the queries on the application level, which can also work, but I'm back at the Graph Manager locking problem when merging them into one context for processing.

Today's hardware with databases on SSDs that can deliver 3 GByte/s or more, and 16+ cores for processing calls for parallelization on every level.