Fetching lots of objects

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Fetching lots of objects

Musall, Maik
Hi all,

I have a number of statistics functions which need to fetch large amounts of objects. I need the actual DataObjects because that's where the business logic is that I need for the computations.

Let's say I need to fetch 300.000 objects. Let's also assume the database sits on a fast SSD array and can serve multiple connections easily. I'm assuming in this case the CPU time needed for DataObject instantiation is the main performance constraint. Is that correct?

If so, how can I speed this up? Could I partition my fetch, and fetch in several threads in parallel into the same ObjectContext? Or is there an easier way to make use of multiple CPU cores for this?

Thanks
Maik

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Fetching lots of objects

Markus Reich
Hi Maik,

maybe you can use the new iterator and split the iterator for parallel
computation?

public static <T> Stream<T> asStream(Iterator<T> sourceIterator, boolean
parallel) {
   Iterable<T> iterable = () -> sourceIterator;
   return StreamSupport.stream(iterable.spliterator(), parallel);
}

found at
http://stackoverflow.com/questions/24511052/how-to-convert-an-iterator-to-a-stream

br
Meex

Musall, Maik <[hidden email]> schrieb am Mo., 6. März 2017 um
22:25 Uhr:

> Hi all,
>
> I have a number of statistics functions which need to fetch large amounts
> of objects. I need the actual DataObjects because that's where the business
> logic is that I need for the computations.
>
> Let's say I need to fetch 300.000 objects. Let's also assume the database
> sits on a fast SSD array and can serve multiple connections easily. I'm
> assuming in this case the CPU time needed for DataObject instantiation is
> the main performance constraint. Is that correct?
>
> If so, how can I speed this up? Could I partition my fetch, and fetch in
> several threads in parallel into the same ObjectContext? Or is there an
> easier way to make use of multiple CPU cores for this?
>
> Thanks
> Maik
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Fetching lots of objects

Musall, Maik
Hi Marcel,

I know how to do the actual computation in parallel. My question is how to fetch and instantiate the DataObjects in parallel before I can start the computations. An iterator would only slow down the fetch because of the added roundtrips. Iterators are about reducing memory footprint, while I am not memory-constrained here.

Maik

> Am 07.03.2017 um 08:30 schrieb Markus Reich <[hidden email]>:
>
> Hi Maik,
>
> maybe you can use the new iterator and split the iterator for parallel
> computation?
>
> public static <T> Stream<T> asStream(Iterator<T> sourceIterator, boolean
> parallel) {
>   Iterable<T> iterable = () -> sourceIterator;
>   return StreamSupport.stream(iterable.spliterator(), parallel);
> }
>
> found at
> http://stackoverflow.com/questions/24511052/how-to-convert-an-iterator-to-a-stream
>
> br
> Meex
>
> Musall, Maik <[hidden email]> schrieb am Mo., 6. März 2017 um
> 22:25 Uhr:
>
>> Hi all,
>>
>> I have a number of statistics functions which need to fetch large amounts
>> of objects. I need the actual DataObjects because that's where the business
>> logic is that I need for the computations.
>>
>> Let's say I need to fetch 300.000 objects. Let's also assume the database
>> sits on a fast SSD array and can serve multiple connections easily. I'm
>> assuming in this case the CPU time needed for DataObject instantiation is
>> the main performance constraint. Is that correct?
>>
>> If so, how can I speed this up? Could I partition my fetch, and fetch in
>> several threads in parallel into the same ObjectContext? Or is there an
>> easier way to make use of multiple CPU cores for this?
>>
>> Thanks
>> Maik
>>
>>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Fetching lots of objects

Aristedes Maniatis-2
In reply to this post by Musall, Maik
On 7/3/17 8:25am, Musall, Maik wrote:
> Hi all,
>
> I have a number of statistics functions which need to fetch large amounts of objects. I need the actual DataObjects because that's where the business logic is that I need for the computations.
>
> Let's say I need to fetch 300.000 objects. Let's also assume the database sits on a fast SSD array and can serve multiple connections easily. I'm assuming in this case the CPU time needed for DataObject instantiation is the main performance constraint. Is that correct?
>
> If so, how can I speed this up? Could I partition my fetch, and fetch in several threads in parallel into the same ObjectContext? Or is there an easier way to make use of multiple CPU cores for this?


I don't think there is anything in Cayenne that will specifically help you here. However if you can partition your search query, the of course you can fetch the data in multiple threads in parallel.

You might also want to fetch into DataRows rather than creating object entities. I'm not sure if that will make your use case faster, but you could try, especially if you don't need all the columns from the db entity.

Ari



--
-------------------------->
Aristedes Maniatis
GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Fetching lots of objects

Musall, Maik
Hi Ari,

> Am 07.03.2017 um 23:14 schrieb Aristedes Maniatis <[hidden email]>:
>
> On 7/3/17 8:25am, Musall, Maik wrote:
>> Hi all,
>>
>> I have a number of statistics functions which need to fetch large amounts of objects. I need the actual DataObjects because that's where the business logic is that I need for the computations.
>>
>> Let's say I need to fetch 300.000 objects. Let's also assume the database sits on a fast SSD array and can serve multiple connections easily. I'm assuming in this case the CPU time needed for DataObject instantiation is the main performance constraint. Is that correct?
>>
>> If so, how can I speed this up? Could I partition my fetch, and fetch in several threads in parallel into the same ObjectContext? Or is there an easier way to make use of multiple CPU cores for this?
>
>
> I don't think there is anything in Cayenne that will specifically help you here. However if you can partition your search query, the of course you can fetch the data in multiple threads in parallel.
>
> You might also want to fetch into DataRows rather than creating object entities. I'm not sure if that will make your use case faster, but you could try, especially if you don't need all the columns from the db entity.

I tried that already. Results:

regular SelectQuery: 25888 ms for 1291644 objects
DataRowQuery alone: 14289 ms for 1291644 rows
DataRowQuery sequential instantiation: 6878 ms for 1291644 objects, sum = 21167
DataRowQuery parallel instantiation: 7351 ms for 1291644 objects, sum = 21640
DataRowQuery with iterator: 22484 ms for 1291644 objects
DataRowQuery with batch iterator of 100 each: 21219 ms for 1291644 objects

sequential/parallel was stream() vs. parallelStream(). The difference between parallel and sequential instantiation was random.

So, all in all not that much of a difference. The DataRowQuery alone is faster of course, but once you add the instantiation, it ends up in the same ballpark as the regular SelectQuery. A bit faster, but probably not worth the additional coding, or deviating from the regular APIs.

Consistently fastest was doing the parallel fetch: DataRowQuery parallel fetch+instantiation: 19357 ms for 1291644 objects. I partitioned the fetch into 4 pieces (exprs is a list of 4 expressions), and then did:

        List<PDCMarketingInfo> objects = exprs.parallelStream()
                .flatMap( exp -> {
                        SelectQuery<DataRow> dataRowQuery = SelectQuery.dataRowQuery( PDCMarketingInfo.class, exp );
                        List<DataRow> dataRows = dataRowQuery.select( oc );
                        return dataRows.parallelStream().map( row -> oc.objectFromDataRow( PDCMarketingInfo.class, row ) );
                } )
                .collect( Collectors.toList() );

I also did this with iterator instead of dataRowQuery.select(), but that was slower.

There may be more benefit from parallelization depending on the hardware used. This was my 2013 MBP with 4 i7 cores.

Maik

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Fetching lots of objects

Aristedes Maniatis-2
On 8/3/17 6:54pm, Musall, Maik wrote:

> regular SelectQuery: 25888 ms for 1291644 objects
> DataRowQuery alone: 14289 ms for 1291644 rows
> DataRowQuery sequential instantiation: 6878 ms for 1291644 objects, sum = 21167
> DataRowQuery parallel instantiation: 7351 ms for 1291644 objects, sum = 21640
> DataRowQuery with iterator: 22484 ms for 1291644 objects
> DataRowQuery with batch iterator of 100 each: 21219 ms for 1291644 objects

What about trying the new M5 release from yesterday and its ability to select just the columns you need. You'll just get a list of column data instead of a simpler object model, but it might be faster.

Ari



--
-------------------------->
Aristedes Maniatis
GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Fetching lots of objects

Musall, Maik

> Am 08.03.2017 um 10:56 schrieb Aristedes Maniatis <[hidden email]>:
>
> On 8/3/17 6:54pm, Musall, Maik wrote:
>
>> regular SelectQuery: 25888 ms for 1291644 objects
>> DataRowQuery alone: 14289 ms for 1291644 rows
>> DataRowQuery sequential instantiation: 6878 ms for 1291644 objects, sum = 21167
>> DataRowQuery parallel instantiation: 7351 ms for 1291644 objects, sum = 21640
>> DataRowQuery with iterator: 22484 ms for 1291644 objects
>> DataRowQuery with batch iterator of 100 each: 21219 ms for 1291644 objects
>
> What about trying the new M5 release from yesterday and its ability to select just the columns you need. You'll just get a list of column data instead of a simpler object model, but it might be faster.
>

This is M5 already (M6-SNAPSHOT really). But I need the full objects because I need to do computations on them using the business logic implemented in the DataObject class.

Maik

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Fetching lots of objects

John Huss
If parallel is going to have any benefit you have to be using separate
object contexts to avoid locking the same DataRow cache.
On Wed, Mar 8, 2017 at 5:59 AM Musall, Maik <[hidden email]> wrote:

>
> > Am 08.03.2017 um 10:56 schrieb Aristedes Maniatis <[hidden email]>:
> >
> > On 8/3/17 6:54pm, Musall, Maik wrote:
> >
> >> regular SelectQuery: 25888 ms for 1291644 objects
> >> DataRowQuery alone: 14289 ms for 1291644 rows
> >> DataRowQuery sequential instantiation: 6878 ms for 1291644 objects, sum
> = 21167
> >> DataRowQuery parallel instantiation: 7351 ms for 1291644 objects, sum =
> 21640
> >> DataRowQuery with iterator: 22484 ms for 1291644 objects
> >> DataRowQuery with batch iterator of 100 each: 21219 ms for 1291644
> objects
> >
> > What about trying the new M5 release from yesterday and its ability to
> select just the columns you need. You'll just get a list of column data
> instead of a simpler object model, but it might be faster.
> >
>
> This is M5 already (M6-SNAPSHOT really). But I need the full objects
> because I need to do computations on them using the business logic
> implemented in the DataObject class.
>
> Maik
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Fetching lots of objects

Musall, Maik
Whoa. Parallel instantiation down to <2700 ms using multiple threads with a local ObjectContext each.

Well, if I need them all in the same context to work with after this, I would then need to localObject() them and be back at locking, this time against the graph manager. Dang. It would be nice if Cayenne would internally parallelize things like ObjectResolver.objectsFromDataRows() and use lock-free strategies to deal with the caching.


> Am 08.03.2017 um 14:17 schrieb John Huss <[hidden email]>:
>
> If parallel is going to have any benefit you have to be using separate
> object contexts to avoid locking the same DataRow cache.
> On Wed, Mar 8, 2017 at 5:59 AM Musall, Maik <[hidden email]> wrote:
>
>>
>>> Am 08.03.2017 um 10:56 schrieb Aristedes Maniatis <[hidden email]>:
>>>
>>> On 8/3/17 6:54pm, Musall, Maik wrote:
>>>
>>>> regular SelectQuery: 25888 ms for 1291644 objects
>>>> DataRowQuery alone: 14289 ms for 1291644 rows
>>>> DataRowQuery sequential instantiation: 6878 ms for 1291644 objects, sum
>> = 21167
>>>> DataRowQuery parallel instantiation: 7351 ms for 1291644 objects, sum =
>> 21640
>>>> DataRowQuery with iterator: 22484 ms for 1291644 objects
>>>> DataRowQuery with batch iterator of 100 each: 21219 ms for 1291644
>> objects
>>>
>>> What about trying the new M5 release from yesterday and its ability to
>> select just the columns you need. You'll just get a list of column data
>> instead of a simpler object model, but it might be faster.
>>>
>>
>> This is M5 already (M6-SNAPSHOT really). But I need the full objects
>> because I need to do computations on them using the business logic
>> implemented in the DataObject class.
>>
>> Maik
>>
>>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Fetching lots of objects

Andrus Adamchik
Hi Maik,

> On Mar 8, 2017, at 7:47 PM, Musall, Maik <[hidden email]> wrote:
>
> Well, if I need them all in the same context to work with after this, I would then need to localObject() them and be back at locking, this time against the graph manager. Dang.

Yes. Unfortunately.

> It would be nice if Cayenne would internally parallelize things like ObjectResolver.objectsFromDataRows() and use lock-free strategies to deal with the caching.

This is probably the last (and consequently the worst) place in Cayenne where locking still occurs. After I encountered this problem in a high-concurrency system, I've done some analysis of it (see [1] and also [2]), and this has been my "Cayenne 5.0" plan for a long time. With 4.0 making such progress as it does now, we may actually start contemplating it again.

Andrus


[1] https://lists.apache.org/thread.html/b3a990f94a8db3818c7f12eb433a8fef89d5e0afee653def11da1aa9@1382717376@%3Cdev.cayenne.apache.org%3E
[2] https://lists.apache.org/thread.html/bfcf79ffa521e402d080e3aafc5f0444fa0ab7d09045ec3092aee6c2@1382706785@%3Cdev.cayenne.apache.org%3E



Loading...