-
yanmaani
hyc: I wasn't accusing you of anything, but it wasn't in design.md so...
-
yanmaani
Has anyone tried to run a randomX program through Clang/GCC, see what they find/
-
yanmaani
?*
-
Quarky93TongWu[m
So TSMC 5nm is capable of 32Mb/mm^2 SRAM (with control logic), a chip could be made with hundreds of randomx instances. It would be expensive of course but Bitmain already uses 5nm and is probably taping out 3nm as we speak.
-
Quarky93TongWu[m
A zen 2 core is about the same size as 4MB of cache. You could slim it down drastically by getting rid of branch prediction, and unnecessary instruction extensions
-
Quarky93TongWu[m
* So TSMC 5nm is capable of 32Mb/mm^2 SRAM (with control logic), a chip could be made with hundreds of randomx instances. It would be expensive of course but Bitmain already uses 5nm and is probably taping out 3nm as we speak (for Bitcoin of course).
-
Quarky93TongWu[m
I wonder if a sea of processors approach with many small simple cores + SRAM (SRAM is much smaller than associative cache), I guess like a modern Xeon Phi would be a viable ASIC.
-
Quarky93TongWu[m
Even though each instance of RandomX is single threaded, the nature of mining/pow itself is embarrassingly parallel.
-
Quarky93TongWu[m
Seymour Cray famously said "If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens?". Looking at the history of parallel computing it's pretty clear the chickens have won.
-
Quarky93TongWu[m
Even so it would be hard to see several orders of magnitude efficiency increase with ASICs, so I guess RandomX is still the best PoW in that regard.
-
Quarky93TongWu[m
-
Quarky93TongWu[m
but scaled to 5nm, with a higher SRAM-to-core ratio
-
Quarky93TongWu[m
I'm not criticising RandomX to be clear... Just vomiting my thoughts from a hardware perspective
-
sech1
You're talking about chip size savings, but efficiency per watt will be approximately the same, at most 2-3x better. Many small cores will also suffer from DRAM access penalties - it's very hard to create efficient memory controller that can handle thousands of parallel accesses
-
Quarky93TongWu[m
yea, i agree, the efficiency will not be orders of magnitude more than consumer CPUs. Especially when accessing off-chip memory (e.g. DDR4).
-
Quarky93TongWu[m
Although it should be noted that, HBM is several times more efficient than DDR4. There is also research in bringing compute to memory chips themselves.
-
Quarky93TongWu[m
-
Quarky93TongWu[m
Intel Saphire Rapids will bring HBM to server CPUs tho
-
Quarky93TongWu[m
Also, if you have many slow cores, that leaves more room for re-ordering memory transactions as each program iteration will take more time
-
Quarky93TongWu[m
* Also, if you have many slow cores, that leaves more room for re-ordering memory transactions (within the memory controller) as each program iteration will take more time
-
sech1
The thing is, chips with HBM are prohibitively expensive, even AMD/NVIDIA do this only for their datacenter GPUs
-
sech1
By the time ASIC manufacturers are able to use all this magic tech, AMD/Intel/NVIDIA will already have it in consumer chips
-
Quarky93TongWu[m
it will be more common very soon, fundamentally it's not any more cost to produce than DDR, it's just a question of scale.
-
sech1
Also, in RandomX something like 90% of energy is spent by CPU
-
hyc
-
hyc
you're talking about flocks of tiny cores, but none of these can do 64bit arithmetic
-
sech1
it also doesn't make sense to make a core much smaller than 2 MB SRAM it needs to use
-
sech1
and that is not a small core already
-
Quarky93TongWu[m
Well the Epiphany-V does do 64-bit floating point, just not enough cache to run RandomX for every core
-
hyc
I'm surprised APple didn't use HBM for the M1's on-chip memory
-
hyc
of all companies, they could afford to make such a decision, at this point in history
-
sech1
something like Cortex A-55 modification + 2 MB SRAM instead of data caches, all that on 5nm would be optimal
-
Quarky93TongWu[m
<sech1 "it also doesn't make sense to ma"> true, logic transistors are cheaper than memory transistors. however you could still strip out alot of cruft from the typical x86 core and just put in more ALUs
-
hyc
more ALUs doesn't help without more bandwidth
-
sech1
more ALUs is not even needed, just put more cores instead
-
hyc
Intel x86 has always been bandwidth-starved
-
sech1
that moves the problem to a super-efficient memory controller, and those IPs are very expensive and/or well guarded
-
Quarky93TongWu[m
and get rid of the cache tagging, snooping and victim cache logic. still pretty significant savings
-
sech1
Unless they do something like distributed on-chip memory + fast interconnect
-
hyc
getting rid of the cache logic would just make software harder to write. it's been done before.
-
hyc
remember the Cell engine
-
sech1
2-3x overall efficiency improvement over Ryzen or Apple M1 is possible
-
sech1
it'll all be limited by how efficient ALUs they can make
-
Quarky93TongWu[m
<hyc "getting rid of the cache logic w"> yea but if you're just interested in running randomX, it would be worth it
-
sech1
not by scratchpad/dataset access
-
sech1
Apple M1 is already 5nm though
-
sech1
so maybe only 2x over that
-
hyc
no way will any ASIC get to more advanced processes before the big CPU houses do
-
Quarky93TongWu[m
<sech1 "Apple M1 is already 5nm though"> yea but also 80% of the die size are not cores
-
Quarky93TongWu[m
<hyc "no way will any ASIC get to more"> but Bitmain is constantly ahead on nodes, their current S9/Pro is 5nm right?
-
hyc
is it?
-
sech1
7nm
-
Quarky93TongWu[m
cost structure is simplified a lot when you can just produce and use it yourself. no need to market, ship, pay tarifs..
-
sech1
-
hyc
they are lower priority customers than Apple
-
sech1
"On 27 February, Bitmain officially announced the Antminer S19 and S19 Pro. Equipped with a custom-built 7nm chip from Bitmain"
-
sech1
And even when Bitmain gets hands on 5nm, it'll be used for Bitcoin ASICs
-
Quarky93TongWu[m
<sech1 ""On 27 February, Bitmain officia"> fair enough
-
sech1
5nm is already fully booked for more than a year ahead
-
sech1
currently Apple, soon AMD with NVIDIA
-
Quarky93TongWu[m
<sech1 "And even when Bitmain gets hands"> agreed, unless monero flips Bitcoin haha
-
sech1
by the time Monero flips Bitcoins, it'll be mined by hundreds of millions of CPUs, many of them with "free" electricity
-
sech1
a few dozen thousands of ASICs would be a drop in the ocean
-
sech1
at worst it'll be like Ethereum where ASICs exist but mining stays very profitable for everyone
-
Quarky93TongWu[m
I have a feeling GPU eth miners will be pushed out soon when gas fees settle down
-
sech1
latest RTX GPUs have very high ETH efficiency, almost the same as ASICs
-
Quarky93TongWu[m
-
Quarky93TongWu[m
Bitmain likes to embellish, but it this is ballpark correct, it's still several times more efficient than RTX 3000
-
Quarky93TongWu[m
* Bitmain likes to embellish, but if this is ballpark correct, it's still several times more efficient than RTX 3000
-
Quarky93TongWu[m
3GH/s @ 2556W
-
sech1
I haven't seen official numbers yet
-
hyc
Why would Nvidia bother to create a mining-only product line
-
hyc
they know years in advance what the tech landscape will be
-
Quarky93TongWu[m
<hyc "Why would Nvidia bother to creat"> to avoid used cards flooding the market, affecting sales of their next-gen
-
Quarky93TongWu[m
that's one of the reasons turing didn't do that well
-
Quarky93TongWu[m
1080 ti and pascal in general was cheap and available in part because they were good cards, but also due to mining boom in 2017 and crash in 2018
-
Quarky93TongWu[m
anyway, i don't want to come off as critical of RandomX, you guys really came up with something cool here, and it seems to be the best ASIC resistant solution so far.
-
hyc
it's important to keep watching the tech space to see where challenges could arise
-
hyc
but we already do that...
-
Quarky93TongWu[m
<hyc "it's important to keep watching "> agreed 🙂
-
Quarky93TongWu[m
if we get a RISC-V JIT compiler, I'd like to do some simulations in QEMU for fun :P
-
sech1
it's in the plans
-
sech1
can QEMU run some RISC-V?
-
Quarky93TongWu[m
yep
-
sech1
someone promised to send me RISC-V board:
xmrig/xmrig #1924
-
sech1
let's see how it goes, I haven't heard anything yet
-
Quarky93TongWu[m
RISC-V support is in upstream QEMU now.
-
Quarky93TongWu[m
-
Quarky93TongWu[m
it's also possible to put a linux bootable riscv on FPGA
-
Quarky93TongWu[m
-
Quarky93TongWu[m
tho it's not optimized for it
-
yanmaani
Has anyone tried to run a randomX program through Clang or GCC, see what they find?
-
yanmaani
Or do conventional implementations already optimize it as best they can?
-
yanmaani
I see there's code to generate C code in tevador/RandomX
-
hyc
not relevant
-
yanmaani
???
-
yanmaani
For example, a C compiler might be able to do smart register scheduling, or elimination of dead code, or similar, better than a naïve jit compiler could
-
moneromooo
Fast enough to be of any use ?
-
yanmaani
moneromooo: well that's my question - anyone tried it?
-
hyc
and the answer is no.
-
hyc
this is already explained in the Design doc.
-
hyc
there isn't enough time for complex optimizations
-
hyc
have you ever timed the startup time of clang? just to do nothing at all but to page the executables into memory and start running?
-
yanmaani
hyc: Right, but if clang is able to find optimizations, then it seems easy-ish to make a tiny "compiler" with only the applicable optimizations too.
-
hyc
lol no
-
hyc
none of the optimizations you mention are easy. they all require multiple-pass analysis
-
yanmaani
not necessarily, of course, but if the optimizations turn out to be simple ones...?
-
hyc
then they're of low impact
-
hyc
dead code elimination, seriously?
-
hyc
that requires complete control-flow analysis of the code start-to-finish
-
hyc
by te time you did that, the entire code could have executed in a JIT already
-
moneromooo
peephole is most likely single pass.
-
Quarky93TongWu[m
Does the JIT currently do any optimisations at all or is it essentially a direct translation?
-
hyc
very little
-
hyc
peephole could do some instruction fusion, yeah
-
Quarky93TongWu[m
Most uArch front ends do fusion anyways
-
hyc
but that's not single-pass. once you've identified points to optimize, you must go thru and relocate all affected address references
-
hyc
right - so it's usually wasted effort
-
moneromooo
Oh, good point.
-
hyc
remember, I was a gcc maintainer for ~10 years. I've been there and done that, many times.
-
hyc
you will always lose more time in code translation than you will regain in code execution time
-
hyc
AOT optimization is cost-effective when the resulting code is reused a lot.
-
hyc
a RandomX program for a given nonce only gets used once, then it's discarded forever
-
sech1
yanmaani hyc speaking of micro-optimizations, there are some in xmrig JIT compiler
-
sech1
like removing redundant CFROUND instructions
-
sech1
but they are all single-pass O(1) complexity code, otherwise it's too slow for JIT
-
hyc
right
-
sech1
removing CFROUND gave only 0.05% speedup
-
sech1
because not many programs have more than 1 CFROUND without FP instructions between them
-
hyc
yeah, talk about edge cases
-
hyc
we don't have a NOP opcode do we
-
sech1
we do, but it has frequency 0
-
sech1
even removing CFROUND is done by overwriting it with NOPs because moving generated code and fixing offsets is too slow
-
sech1
JIT compiler time constraints are very tight
-
hyc
yeah as I'd expect
-
sech1
of course if there's some peephole optimization that gives +10% in the main loop, JIT can sped a lot of time to do it. But I'm not aware of any such optimization and I spent 17 months tinkering with RandomX :D
-
sech1
actually more than 17 months, 2 years already
-
yanmaani
So what speedup does a modern compiler get if you allow it to optimize the main loop for 2 years straight on a modern supercomputer?
-
sech1
I haven't seen a compiler that can beat asm optimization by hand, given the programmer is experienced enough
-
sech1
and I can't see much places in the optimized code that can be optimized
-
sech1
*in the generated code
-
yanmaani
what would hand-optimizing the asm get you then?
-
sech1
Maybe +1-2% if you can spot places where instructions cancel each other
-
sech1
RandomX programs have a lot of branch instructions which makes them a "train" of short pieces of code divided by branches
-
sech1
not much can be done in each piece
-
sech1
this is all valid for modern superscalar out-of-order CPUs
-
sech1
there are a lot of things that can be done for in-order CPUs like Cortex A53
-
sech1
RandomX JIT doesn't reorder instructions at all and relies on CPU to do that
-
yanmaani
sech1: Does the optimization for A53 improve things, or just bring it back to the level of an OOE CPU?
-
sech1
so I estimate it's possible to get +50% by hand-optimizing generated code for A53
-
sech1
it doesn't matter much because 5 h/s vs 7.5 h/s is laughable anyway
-
yanmaani
aren't a53s much cheaper though?
-
hyc
not enough cheaper to be worth that
-
sech1
-
sech1
up to 8 MB cache per cluster, so a cluster of 4 cores + 8 MB cache would be perfect
-
Quarky93[m]
<sech1 "RandomX JIT doesn't reorder inst"> So in theory the Apple M1 should do an amazing job. If it wasn’t on macOS >.>
-
sech1
plus wide out of order design
-
sech1
Cortex-X1, Apple M1 are the most efficient per watt on RandomX right now
-
sech1
mostly thanks to 5nm node they use
-
yanmaani
does anyone sell boards with like a trillion ARM processors on them
-
Quarky93[m]
Extremely deep reorder buffer plus very wide decode stage
-
yanmaani
kind of like xeon phi but chinese
-
Quarky93[m]
<yanmaani "kind of like xeon phi but chines"> Mmm the Japanese industry always make interesting stuff
-
hyc
the closest would be Phytium FT-2000 64-core chip
-
Quarky93[m]
A64FX comes to mind but very expensive. Low volume part only for their Fugaku super computer
-
hyc
-
hyc
but production is canceled now due to US trade sanctions
-
hyc
-
Quarky93[m]
There’s also Xilinx Versal ACAP. With 400 VLIW cores + FPGA but it suffers from only 100Mb of cache for the VLIW cores
-
Quarky93[m]
191Mb if you count fpga block rams but those are a few clock cycles away
-
hyc
FPGAs won't cut it
-
Quarky93[m]
I know but these are hardened cores
-
hyc
we've discussed Xilinx Versal in here before
-
hyc
the cores aren't adequate, even if you interfaced to enough external RAM
-
yanmaani
as monero gets bigger, would it eventually spur cheap bulk CPUs specifically optimized for randomx?
-
hyc
maybe.
-
hyc
I suppose if you treat the majority of cores as co-processors instead of peer CPUs the control netowrk would be simpler
-
hyc
that's still the major limiting factor in putting more cores on a chip
-
hyc
these cores don't need to run any OS, just point them at a chunk of dataset RAM and a chunk of randomx code and let them run
-
hyc
so they don't need a lot of the multi-processor comms support that would usually be needed
-
Quarky93[m]
i guess a many core cpu designed for randomx could also be binned extremely well, as really no individual core is critical
-
Quarky93[m]
memory controllers also don't need to feed cores on the other side of the chip, just ones in its local area, as long as there is at least 2GB per memory controller
-
hyc
we will probably up the RAM requirement to 4GB within a year or two
-
Quarky93[m]
ah ok
-
Quarky93[m]
seems reasonable, is there any reason not to go 8GB?
-
hyc
I wouldn't do it yet. 4GB is still feasible for a lot of smartphones. 8GB is pushing it.
-
yanmaani
8gb is also infeasible for most computers running an OS
-
hyc
true
-
Quarky93[m]
does anyone really still run 8GB with a CPU fast enough to be profitable with mining?
-
Quarky93[m]
aside from base model macbook >.>
-
hyc
I ordered my M1 macbook with 16GB RAM, but the base model is only 8GB
-
hyc
and just booting up to desktop 11GB of RAM is in use
-
hyc
I suppose a lot of that is swapped out on an 8GB machine
-
hyc
would not be a pleasant experience
-
Quarky93[m]
fair enough, if you where actively using the computer while mining it'd be pretty terrible
-
Quarky93[m]
* fair enough, if you where actively using the computer while mining it'd be pretty terrible, even for 16GB machine
-
gingeropolous
could you make a randomx coprocessor that the primary CPU could use effectively?
-
gingeropolous
like, where else in contemporary computing do you have a series of random functions that need to be executed as fast as possible?
-
gingeropolous
i guess if you used machine learning to optimize some signal to output relationship
-
gingeropolous
but is that what you were talking about in different words?
-
gingeropolous
probably a torus
-
yanmaani
gingeropolous: isn't a "randomx coprocessor" just a processor?
-
gingeropolous
well like y'all were saying, you rip all the stuff out that ends up useless. like the branch prediction. i remember seeing images of CPUs and the branch predictors eat so much die
-
yanmaani
branch prediction is sorta needed though?
-
hyc
yes
-
hyc
the stuff I would rip out is everything that supports an OS - interrupt handlers, permission levels
-
yanmaani
does that even use a lot of die space?
-
hyc
if you look at all of the security bugs Intel has had to deal with in the past year or two
-
hyc
spectre etc... yes.
-
yanmaani
I'm sort of wondering why supercomputers don't do that. I mean, they don't have to bother with permissions right?
-
hyc
the reason those bugs existed is because Intel tried to cut corners where they shouldn't have
-
hyc
supercomputers still tend to run multiuser OSs so they still have to deal with that
-
hyc
Crays ran Unix. modern supercomputers run linux
-
yanmaani
hyc: sure, cause it's a lower cost
-
yanmaani
develop new OS + develop interrupt-less CPU < buy normal CPUs and use normal Linux
-
yanmaani
cause I think Linux can run on e.g. MMU-less systems, so it seems like cutting protection would be a no-brainer if it had significant savings
-
hyc
hm. it's been a long time since I tried to run linux on a machine with no MMU, dunno if that's still supported
-
yanmaani
it's supported by linux, but not by glibc
-
yanmaani
so you have to run uclibc, and you have certain limitations on mmap etc
-
hyc
ah
-
hyc
cool
-
hyc
glibc is such a pig
-
hyc
the other day I wrote a little toy that was 300 bytes long. 16K when linked with glibc
-
hyc
I rewrote it to use syscall() instead of the usual library calls, got it back down to 500 bytes
-
yanmaani
why not just use musl?