Dramatic slowdown to draw invisible galaxies

jogad · Post #1by **jogad** » 03.06.2009, 16:27

Hello,

A mere field of stars. Milky Way is not visible from this point of vue and the few galaxies at this place are too faint to be seen.

fps497.jpg

fps205.jpg

You can see that there is no visible difference between the two pictures but on the second one, the galaxies display is enabled and the drop of FPS is very noticeable (497 downto 205).

Of course, it is not important if we have a fast computer and when there are only stars on the screen. :lol:

But that occurs also when looking at a planetary surface, even if no sky is visible at all. :cry:

And if we have a big add-on, even with a fast computer, each FPS is welcome. :wink:

Is it only on my system or are there settings to do with my graphics card to minimize this behavior :?:

Post #2by **t00fri** » 03.06.2009, 18:52

Jogad,

unlike the case where galaxy rendering is globally disabled, each of the 10000+ galaxies must be checked throughout the rendering loop whether it is visible or not in the given field, when galaxy rendering is allowed. Of course, there is efficient octree culling based on apparent magnitude of the galaxies, but still this unavoidable check is bound to use some CPU time... Obviously, you can only tell the galaxies' visibility once you have calculated their position relative to the observer, ...

Fridger

jogad · Post #3by **jogad** » 03.06.2009, 20:21

Fridger,

Very clear and concise answer.
I used to think it was the graphics card which limited the speed of graphics programs and not the calculations of the processor.
I'll have to change a little bit my point of view on that.
Thank you very much.

Chuft-Captain · Post #4by **Chuft-Captain** » 03.06.2009, 22:26

jogad wrote:Fridger,

Very clear and concise answer.
I used to think it was the graphics card which limited the speed of graphics programs and not the calculations of the processor.
I'll have to change a little bit my point of view on that.
Thank you very much.

Actually, it's not as simple as that. It very much depends on the software and the circumstances. Celestia tends to be very CPU intensive because it's doing a lot of calculations determining the positions of objects, whether they need to be rendered, etc, etc..., before passing the scene over to the GPU to be displayed.
The balance of how much work is done by the CPU versus the GPU will vary from one program to the next depending on how they are coded and optimized, and even in fact within Celestia depending on what type of tasks it's undertaking at a given time.
However the speed of your graphics card is still the limiting factor determining how many polygons (ie. how complex a scene) can be rendered in a given time-slice.

The balance is beginning to shift again with CUDA enabled chips becoming more commonplace, which gives developers the opportunity to shift some of the calculations the CPU currently does onto the GPU. Theoretically, this should improve the performance of a CPU intensive program like Celestia (assuming the developers choose to take advantage of CUDA features). I don't know what Chris' plans are wrt this, but I would guess he probably won't be wanting to make any major changes until a C++ API is available for CUDA.

GPUs typically have faster memory and faster clockrates than CPUs, and architectural features optimized for repetitive graphics tasks (versus CPU's which tend to have more general-purpose architectures).

CC

Post #5by **t00fri** » 04.06.2009, 06:29

Chuft-Captain wrote:
jogad wrote:Fridger,

Very clear and concise answer.
I used to think it was the graphics card which limited the speed of graphics programs and not the calculations of the processor.
I'll have to change a little bit my point of view on that.
Thank you very much.
Actually, it's not as simple as that. It very much depends on the software and the circumstances. Celestia tends to be very CPU intensive because it's doing a lot of calculations determining the positions of objects, whether they need to be rendered, etc, etc..., before passing the scene over to the GPU to be displayed.
The balance of how much work is done by the CPU versus the GPU will vary from one program to the next depending on how they are coded and optimized, and even in fact within Celestia depending on what type of tasks it's undertaking at a given time.
However the speed of your graphics card is still the limiting factor determining how many polygons (ie. how complex a scene) can be rendered in a given time-slice.

The balance is beginning to shift again with CUDA enabled chips becoming more commonplace, which gives developers the opportunity to shift some of the calculations the CPU currently does onto the GPU. Theoretically, this should improve the performance of a CPU intensive program like Celestia (assuming the developers choose to take advantage of CUDA features). I don't know what Chris' plans are wrt this, but I would guess he probably won't be wanting to make any major changes until a C++ API is available for CUDA.

GPUs typically have faster memory and faster clockrates than CPUs, and architectural features optimized for repetitive graphics tasks (versus CPU's which tend to have more general-purpose architectures).

CC

CC,

CC wrote:Actually, it's not as simple as that.

I think I know pretty well what is going on here, since the entire galaxy code package of Celestia has been written by myself in collaboration with Toti. Of course in my above argumentation, I was trying to concentrate on the main effects to keep things transparent. An important, work-intensive part in my task of implementing 10000+ galaxies was precisely to explore all sorts of culling possibilities as much as possible! This implied many runs of studying individually the various sources of delays...

CC wrote:The balance is beginning to shift again with CUDA enabled chips becoming more commonplace, which gives developers the opportunity to shift some of the calculations the CPU currently does onto the GPU.

Since my F-TexTools / nmtools 2.x support GPU based DXT compression via a CUDA API for C++, I have also collected practical experiences already with GPU-based calculations. I think I am pretty well aware about the present CUDA based possibilities.
See also my long thread about this matter in CelestialMatters (with lots of interaction with cartrite!) http://forum.celestialmatters.org/viewt ... 73&start=0

In general I cannot see how you imagine to use a single GPU BOTH for doing intensive calculations in real time AND --at the same time-- for ambitious rendering tasks like e.g. Celestia. In the more professional SLI-based applications, one normally reserves one GPU for computing tasks (CUDA) and one for rendering. But this will certainly not be the dominant scenario for Celestia ;-)

.

So far, most CUDA applications with a single GPU use it when the GPU is NOT busy with rendering ... Typically like with my F-TexTools, where thousands of DXT compression tasks for the VT tiles are performed at times when the GPU has lots of resources available.

Fridger

Chuft-Captain · Post #6by **Chuft-Captain** » 04.06.2009, 08:58

t00fri wrote:CC,

CC wrote:Actually, it's not as simple as that.
I think I know pretty well what is going on here, since the entire galaxy code package of Celestia has been written by myself in collaboration with Toti. Of course in my above argumentation, I was trying to concentrate on the main effects to keep things transparent. An important, work-intensive part in my task of implementing 10000+ galaxies was precisely to explore all sorts of culling possibilities as much as possible! This implied many runs of studying individually the various sources of delays...

Yes, of course I realize this.
You may have taken my comment as a criticism of your explanation. It was not, but rather it was in response to Jogad's rather straight-forward conclusion. -- Just to say to him that it's never quite that straight-forward. (ie. the fundamental rules still apply

)

t00fri wrote:
CC wrote:The balance is beginning to shift again with CUDA enabled chips becoming more commonplace, which gives developers the opportunity to shift some of the calculations the CPU currently does onto the GPU.
Since my F-TexTools / nmtools 2.x support GPU based DXT compression via a CUDA API for C++, I have also collected practical experiences already with GPU-based calculations. I think I am pretty well aware about the present CUDA based possibilities.
See also my long thread about this matter in CelestialMatters (with lots of interaction with cartrite!) http://forum.celestialmatters.org/viewt ... 73&start=0

In general I cannot see how you imagine to use a single GPU BOTH for doing intensive calculations in real time AND --at the same time-- for ambitious rendering tasks like e.g. Celestia. In the more professional SLI-based applications, one normally reserves one GPU for computing tasks (CUDA) and one for rendering. But this will certainly not be the dominant scenario for Celestia .

So far, most CUDA applications with a single GPU use it when the GPU is NOT busy with rendering ... Typically like with my F-TexTools, where thousands of DXT compression tasks for the VT tiles are performed at times when the GPU has lots of resources available.

Fridger

I agree that offline repetitive image-processing tasks like your tools are good candidates for CUDA processing, however as I understand it the gaming market (real-time rendering) is one of Nvidias main consumer markets for CUDA enabled GPUs (according to Nvidia's own marketing), and because their API hides the details of the hardware, it's a no-brainer that they will add features into future architectures which no doubt will be intended to benefit real-time applications as well as brute off-line image processing.

Notwithstanding this, I seriously doubt that Celestia's real-time engine will become CUDA enabled any time soon for many reasons, not least the fact that the benefits would only be available to those with Nvidia hardware. :wink:

CC

Post #7by **t00fri** » 04.06.2009, 09:35

CC wrote:I agree that offline repetitive image-processing tasks like your tools are good candidates for CUDA processing, however as I understand it the gaming market (real-time rendering) is one of Nvidias main consumer markets for CUDA enabled GPUs (according to Nvidia's own marketing), and because their API hides the details of the hardware, it's a no-brainer that they will add features into future architectures which no doubt will be intended to benefit real-time applications as well as brute off-line image processing.

Notwithstanding this, I seriously doubt that Celestia's real-time engine will become CUDA enabled any time soon for many reasons, not least the fact that the benefits would only be available to those with Nvidia hardware.

CC

I think the crucial point here is the amount of CUDA-based GPU calculations that are to be done during the rendering load. CUDA & GPUs are presently used a lot in demanding scientific calculations (e.g. cosmological simulations of the Universe...) where only SLI type multi GPU arrangements make sense.

In contrast, for gaming, I imagine one uses the GPU mostly for quite small but highly parallelized calculations e.g. for real time fog and the like ...together with the actual rendering task...

Fridger

Chuft-Captain · Post #8by **Chuft-Captain** » 04.06.2009, 10:03

t00fri wrote:
CC wrote:I agree that offline repetitive image-processing tasks like your tools are good candidates for CUDA processing, however as I understand it the gaming market (real-time rendering) is one of Nvidias main consumer markets for CUDA enabled GPUs (according to Nvidia's own marketing), and because their API hides the details of the hardware, it's a no-brainer that they will add features into future architectures which no doubt will be intended to benefit real-time applications as well as brute off-line image processing.

Notwithstanding this, I seriously doubt that Celestia's real-time engine will become CUDA enabled any time soon for many reasons, not least the fact that the benefits would only be available to those with Nvidia hardware.

CC

I think the crucial point here is the amount of CUDA-based GPU calculations that are to be done during the rendering load. CUDA & GPUs are presently used a lot in demanding scientific calculations (e.g. cosmological simulations of the Universe...) where only SLI type multi GPU arrangements make sense.

In contrast, for gaming, I imagine one uses the GPU mostly for quite small but highly parallelized calculations e.g. for real time fog and the like ...together with the actual rendering task...

Fridger

Yes. You've hit the nail on the head.
It's the multi-core massively parallel computational capabilities, and the fact that the management of threads is largely hidden by the API and not the responsibility of the application developer which can give benefits even in a non-SLI situation.

The caveat (as always) is that leveraging this capability is still down to the skill of the developer which is why I used the phrase "Theoretically, this should improve the performance.....". :wink:

The key thing however, is that it appears (if you believe the marketing) that the API makes leveraging the benefits of parallel processing a lot easier than in the past.
I must admit however that I haven't seen the API and even if I had, I don't really have the expertise in this area to make a judgement on how good it is.

CC

jogad · Post #9by **jogad** » 05.06.2009, 17:38

Hi,

CC wrote:Actually, it's not as simple as that.

CC wrote:it was in response to Jogad's rather straight-forward conclusion.

Indeed that appeared clear to me: no graphic render for the galaxies, thus only the CPU was responsilbe for the delay.
Sorry. I am not as smart as I would like to be. :oops:

I hope not to exhaust your patience with my naive questions but there is a point that I don't understand very well. :evil:

I suppose that the problem is more or less the same for stars.
As for the galaxies, we can decide if the stars are visible or not with their positions and their absolute magnitudes.

We are always in a place where the galaxies are invisible.
In these conditions, why is it much faster to do calculations and display of stars than to calculate only if the galaxies are visible or not without having to render them.

After all, there are more than 100000 stars and only a little more than 10000 galaxies. :!:

Chuft-Captain · Post #10by **Chuft-Captain** » 05.06.2009, 19:05

jogad wrote:Hi,

CC wrote:Actually, it's not as simple as that.
CC wrote:it was in response to Jogad's rather straight-forward conclusion.
Indeed that appeared clear to me: no graphic render for the galaxies, thus only the CPU was responsilbe for the delay.
Sorry. I am not as smart as I would like to be.

I hope not to exhaust your patience with my naive questions but there is a point that I don't understand very well.

I wasn't suggesting your conclusion was wrong... just that the answer is not quite so straightforward, because there's many factors that can affect performance.

jogad wrote:I suppose that the problem is more or less the same for stars.
As for the galaxies, we can decide if the stars are visible or not with their positions and their absolute magnitudes.

We are always in a place where the galaxies are invisible.
In these conditions, why is it much faster to do calculations and display of stars than to calculate only if the galaxies are visible or not without having to render them.

After all, there are more than 100000 stars and only a little more than 10000 galaxies.

Fridger can probably give a definitive answer, but my guess is that it's because most of the time stars are rendered just as colored dots of only a few pixels at most (a very cheap and fast operation), whereas galaxies have a more complex extended structure which is more "expensive" and time-consuming to render (hence the need for the culling fridger referred to earlier).
So it's simply that stars are much faster to draw than galaxies, and the difference in performance between rendering and not-rendering stars is negligible (or at least less noticeable).
-- So the problem is not the same for galaxies and stars.

In these conditions, why is it much faster to do calculations and display of stars than to calculate only if the galaxies are visible or not without having to render them.

I'm not sure that is exactly what Fridger was saying. I would imagine that the checking to see what's in the FOV, is probably quite quick. The expensive part I imagine is almost certainly the calculations needed to determine shape and other features of the galaxy, and then drawing them.
If galaxy rendering is actually switched off, then of course we can skip the visibility checks as well. :wink:

If you're really interested and are comfortable with the Visual Studio IDE, there are tools available that make it possible for you to analyze how much processor time is being used by different parts of the code. (You wouldn't need a good understanding of C++ to do this, just a basic understanding of the structure of the program... in fact this would actually give you more insight into how it all works.)

CC

jogad · Post #11by **jogad** » 05.06.2009, 20:01

Chuft-Captain,

Chuft-Captain wrote:I wasn't suggesting your conclusion was wrong

Even if I am wrong it is not a problem and you are welcome to say it. :wink:

Chuft-Captain wrote:my guess is that it's because most of the time stars are rendered just as colored dots of only a few pixels at most (a very cheap and fast operation), whereas galaxies have a more complex extended structure which is more "expensive" and time-consuming to render

Yes. But I have chosen as example a fied of stars where galaxies must not be rendered. :!:

So:
- On one side we have only to calculate if galaxies are visible or not and this must be done on ~ 10000 galaxies. No render to do.
- On the other side we have ~ 100000 stars to test and some of them must be rendered.

And surprisingly the second process if much faster than the first one.

I am not comfortable at all with Visual Studio and not able to investigate the code for that. :oops:

Just curious and interested to know why and if maybe an improvment is possible?

Regards

Post #12by **chris** » 05.06.2009, 22:43

jogad wrote:Chuft-Captain,
Chuft-Captain wrote:I wasn't suggesting your conclusion was wrong
Even if I am wrong it is not a problem and you are welcome to say it.

Chuft-Captain wrote:my guess is that it's because most of the time stars are rendered just as colored dots of only a few pixels at most (a very cheap and fast operation), whereas galaxies have a more complex extended structure which is more "expensive" and time-consuming to render
Yes. But I have chosen as example a fied of stars where galaxies must not be rendered.
So:
- On one side we have only to calculate if galaxies are visible or not and this must be done on ~ 10000 galaxies. No render to do.
- On the other side we have ~ 100000 stars to test and some of them must be rendered.

And surprisingly the second process if much faster than the first one.

I am not comfortable at all with Visual Studio and not able to investigate the code for that.
Just curious and interested to know why and if maybe an improvment is possible?

Jogad,

Your curiosity about the performance of galaxy rendering is reasonable, and I think that it's likely that some improvement is possible. The basic algorithm is an appropriate one, I think, but there are some parameters that could be tuned. For example, the maximum number of galaxies per octree node could be adjusted. I know that Fridger and Toti have already done some tuning of the octree traversals, so there might not be much extra performance to be obtained by simple parameter tweaking.

The data set itself may be partly at fault for the deep sky object octree not operating as efficiently as the star octree. The size of galaxies relative to the typical intergalactic distance is much larger than the star size relative to interstellar distances. This will tend to make the octree less efficient as galaxies that straddle more than one octree node have to be placed in the larger parent node. The closer a node is to the tree root, the less likely it is to get culled during octree traversal.

--Chris

Post #13by **chris** » 06.06.2009, 06:45

I did some tests and found that the deep sky object octree traversal is only a small part of the slowdown. The main cause is the fact that the Milky Way is being drawn even when the brightest parts of it are out of view. This is not a bug--it's expected behavior when the viewer is inside a galaxy's bounding sphere. However, the galaxy drawing code is not optimized. Galaxies are composed of thousands of 'sprites', and quite a few calculations are performed in order to draw each one of them. I optimized a few things and was able to boost the frame rate of a 'no visible galaxies' scene from 157 to 185 fps. There are some more involved optimizations that could be used to get more performance, but ultimately we'd be best off writing a custom vertex shader to offload the per-sprite calculations to the GPU.

Fridger: I can send you my performance tweaks if you're interested. Nothing too exciting--just moving some calculations out of the inner loop, and using the blob's camera space z-coordinate instead of the distance to the viewer.

--Chris

Post #14by **t00fri** » 06.06.2009, 12:34

Chris,

chris wrote:I did some tests and found that the deep sky object octree traversal is only a small part of the slowdown. The main cause is the fact that the Milky Way is being drawn even when the brightest parts of it are out of view. This is not a bug--it's expected behavior when the viewer is inside a galaxy's bounding sphere.

Since quite a long time I am actually aware of the special culling problematics in case of the MilkyWay, if galaxy rendering = OFF along with the frequent case of the observer being located inside the MW boundary. A few days ago I checked explicitly that the difference between galaxy rendering being ON and OFF almost vanishes, after just commenting out the MilkyWay entry in galaxies.dsc. I wanted to write about this fact, but finally didn't, since I was VERY busy throughout the passed days (house renovations ;-)

etc). Anyway, we certainly agree here.

Due to the huge apparent size of the MilkyWay boundary for a typical "inside" observer near Earth, our galaxy is practically rendered allways, if rendering is not manifestly turned off. The rendering and culling issues for far away galaxies are of quite different nature, however.

However, the galaxy drawing code is not optimized. Galaxies are composed of thousands of 'sprites', and quite a few calculations are performed in order to draw each one of them. I optimized a few things and was able to boost the frame rate of a 'no visible galaxies' scene from 157 to 185 fps. There are some more involved optimizations that could be used to get more performance, but ultimately we'd be best off writing a custom vertex shader to offload the per-sprite calculations to the GPU.

Your ~ 15% performance improvement does not look too impressive, I would say.
I could tell about various critical parameters, where relatively small changes would amount to quite a bit more (towards the worse ;-)

). With Toti I systematically examined and logged such "sensitivities" ...

Incidentally, I noted some time ago that the overall rendering performance for galaxies has substantially decreased relative to my benchmark rates that I logged years ago, when our galaxy package was integrated into Celestia. Several people have made smaller changes to galaxy.cpp, render.cpp and octree parameters in the course of time possibly without doing sufficient benchmarking afterwards and not knowing about all the delicate performance balancing that I had done. I am now unable to trace the precise origin of the deterioration. I wrote about this fact already some time ago.

Fridger: I can send you my performance tweaks if you're interested. Nothing too exciting--just moving some calculations out of the inner loop, and using the blob's camera space z-coordinate instead of the distance to the viewer.

--Chris

Yes, please, send me your tweaks. There are various connected and not very conspicuous issues that need to be carefully examined even in case of small code changes to the galaxy code.

I think, it's fair to say that altogether, the performance of our rendering code for 10000+ galaxies is pretty good despite the specific MilkyWay issue.

If sufficiently motivated, one could always add further, more involved MilkyWay-specific constraints to the code in order to prevent MilkyWay rendering, if e.g. the MW core remains invisible... Since Earth is pretty close to the galactic plane, a simple flattening of the boundary ellipsoid wouldn't be very effective, though.

Fridger

Celestia Forums

Dramatic slowdown to draw invisible galaxies

Dramatic slowdown to draw invisible galaxies

Re: Dramatic slowdown to draw invisible galaxies

Re: Dramatic slowdown to draw invisible galaxies

Re: Dramatic slowdown to draw invisible galaxies

Re: Dramatic slowdown to draw invisible galaxies

Re: Dramatic slowdown to draw invisible galaxies

Re: Dramatic slowdown to draw invisible galaxies

Re: Dramatic slowdown to draw invisible galaxies

Re: Dramatic slowdown to draw invisible galaxies

Re: Dramatic slowdown to draw invisible galaxies

Re: Dramatic slowdown to draw invisible galaxies

Re: Dramatic slowdown to draw invisible galaxies

Re: Dramatic slowdown to draw invisible galaxies

Re: Dramatic slowdown to draw invisible galaxies