Here's a little proof-of-concept I whipped up, which generates a sine wave and then tries to detect its pitch! It works "sorta" well!
Since it only tests 128 different offsets over a 1-octave range (wavelengths from 127 to 255 pixels), it doesn't have very good pitch resolution, so it's probably not quite up to snuff as a musical tuner yet, but you might improve on my methods by:
gluing together several successive audio frames into a much longer audio frame so you can crop and compare bigger windows
checking much larger ranges of offsets, maybe refining your pitch estimates by checking the 2x, 3x, etc. correlations
Here's a weird suggestion which might yield some results:
You can use the Make Image From Channels node and some image processing to implement an auto-correlation algorithm. Here's roughly how that might work.
If the input waveform is very periodic, then that means its peaks and troughs will be spaced fairly regularly apart in time. When you feed a periodic waveform into the Make Image From Channels node, that will mean the output image will form evenly-spaced bright bars, and it's going to be very similar to a horizontally-shifted copy of itself.
The critical question is, how far horizontally shifted? Well, that may require trial and error. Try shifting the image over by some offset, say 50 pixels, cropping appropriately, and combining it with the unshifted version using a "Difference" blending mode. If the peaks and troughs of this waveform have lined up very well with its shifted copy, then they should cancel out and the output of "Difference" should be a very dark image, which you can test using the Sample Color From Image node. If they don't line up, then the output image will be brighter.
So, you can rig up a "build list" loop which tries this comparison for every offset within some range under consideration; 50 pixels, then 51, then 52, up to 100 maybe, and records the brightness of each resulting image. The darkest images indicate the offsets with the highest degree of autocorrelation. A little bit of math will transform those offset distances into wavelengths, which in turn map to audio frequencies. 50 pixels represent 50/48000 of a second, so a 50-pixel correlation is ( 48000 / (50 seconds) ) = 960 Hz.
(of course, if the autocorrelation is strong at an offset which is 1x the wavelength of the musical pitch, it will also be strong for 2x, 3x, 4x etc. Some work to detect these cases might be necessary)
One more example of a video-feedback rendering system.
A camera, slowly rotating, is in the center of a cube; the image captured by the camera is pushed through a few different colour transformations, and the results of those are painted onto the walls of the cube in the next frame. Zoomy rainbow trips ensue.
A "Combine image with feedback" node is included in the feedback loop, which blends the new frame with the previous one at a specified opacity. This arrests some intense strobing effects which would otherwise occur, and lends the image a softer, dreamier quality.