#monero-pow

00:34

yanmaani

hyc: I wasn't accusing you of anything, but it wasn't in design.md so...
00:34

yanmaani

Has anyone tried to run a randomX program through Clang/GCC, see what they find/
00:34

yanmaani

?*
04:17

Quarky93TongWu[m

So TSMC 5nm is capable of 32Mb/mm^2 SRAM (with control logic), a chip could be made with hundreds of randomx instances. It would be expensive of course but Bitmain already uses 5nm and is probably taping out 3nm as we speak.
04:20

Quarky93TongWu[m

A zen 2 core is about the same size as 4MB of cache. You could slim it down drastically by getting rid of branch prediction, and unnecessary instruction extensions
04:29

Quarky93TongWu[m

* So TSMC 5nm is capable of 32Mb/mm^2 SRAM (with control logic), a chip could be made with hundreds of randomx instances. It would be expensive of course but Bitmain already uses 5nm and is probably taping out 3nm as we speak (for Bitcoin of course).
04:33

Quarky93TongWu[m

I wonder if a sea of processors approach with many small simple cores + SRAM (SRAM is much smaller than associative cache), I guess like a modern Xeon Phi would be a viable ASIC.
04:33

Quarky93TongWu[m

Even though each instance of RandomX is single threaded, the nature of mining/pow itself is embarrassingly parallel.
04:35

Quarky93TongWu[m

Seymour Cray famously said "If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens?". Looking at the history of parallel computing it's pretty clear the chickens have won.
04:40

Quarky93TongWu[m

Even so it would be hard to see several orders of magnitude efficiency increase with ASICs, so I guess RandomX is still the best PoW in that regard.
04:48

Quarky93TongWu[m

im thinking something like this parallella.org/docs/e5_1024core_soc.pdf
04:48

Quarky93TongWu[m

but scaled to 5nm, with a higher SRAM-to-core ratio
04:55

Quarky93TongWu[m

I'm not criticising RandomX to be clear... Just vomiting my thoughts from a hardware perspective
07:50

sech1

You're talking about chip size savings, but efficiency per watt will be approximately the same, at most 2-3x better. Many small cores will also suffer from DRAM access penalties - it's very hard to create efficient memory controller that can handle thousands of parallel accesses
07:58

Quarky93TongWu[m

yea, i agree, the efficiency will not be orders of magnitude more than consumer CPUs. Especially when accessing off-chip memory (e.g. DDR4).
07:58

Quarky93TongWu[m

Although it should be noted that, HBM is several times more efficient than DDR4. There is also research in bringing compute to memory chips themselves.
07:58

Quarky93TongWu[m

news.samsung.com/global/samsung-dev…dth-memory-with-ai-processing-power
07:59

Quarky93TongWu[m

Intel Saphire Rapids will bring HBM to server CPUs tho
08:01

Quarky93TongWu[m

Also, if you have many slow cores, that leaves more room for re-ordering memory transactions as each program iteration will take more time
08:01

Quarky93TongWu[m

* Also, if you have many slow cores, that leaves more room for re-ordering memory transactions (within the memory controller) as each program iteration will take more time
08:02

sech1

The thing is, chips with HBM are prohibitively expensive, even AMD/NVIDIA do this only for their datacenter GPUs
08:03

sech1

By the time ASIC manufacturers are able to use all this magic tech, AMD/Intel/NVIDIA will already have it in consumer chips
08:03

Quarky93TongWu[m

it will be more common very soon, fundamentally it's not any more cost to produce than DDR, it's just a question of scale.
08:05

sech1

Also, in RandomX something like 90% of energy is spent by CPU
08:05

hyc

processor-in-memory is pretty well studied twitter.com/hyc_symas/status/1171800102471532544
08:05

hyc

you're talking about flocks of tiny cores, but none of these can do 64bit arithmetic
08:07

sech1

it also doesn't make sense to make a core much smaller than 2 MB SRAM it needs to use
08:07

sech1

and that is not a small core already
08:08

Quarky93TongWu[m

Well the Epiphany-V does do 64-bit floating point, just not enough cache to run RandomX for every core
08:09

hyc

I'm surprised APple didn't use HBM for the M1's on-chip memory
08:09

hyc

of all companies, they could afford to make such a decision, at this point in history
08:10

sech1

something like Cortex A-55 modification + 2 MB SRAM instead of data caches, all that on 5nm would be optimal
08:10

Quarky93TongWu[m

<sech1 "it also doesn't make sense to ma"> true, logic transistors are cheaper than memory transistors. however you could still strip out alot of cruft from the typical x86 core and just put in more ALUs
08:10

hyc

more ALUs doesn't help without more bandwidth
08:10

sech1

more ALUs is not even needed, just put more cores instead
08:10

hyc

Intel x86 has always been bandwidth-starved
08:11

sech1

that moves the problem to a super-efficient memory controller, and those IPs are very expensive and/or well guarded
08:11

Quarky93TongWu[m

and get rid of the cache tagging, snooping and victim cache logic. still pretty significant savings
08:11

sech1

Unless they do something like distributed on-chip memory + fast interconnect
08:12

hyc

getting rid of the cache logic would just make software harder to write. it's been done before.
08:13

hyc

remember the Cell engine
08:13

sech1

2-3x overall efficiency improvement over Ryzen or Apple M1 is possible
08:13

sech1

it'll all be limited by how efficient ALUs they can make
08:13

Quarky93TongWu[m

<hyc "getting rid of the cache logic w"> yea but if you're just interested in running randomX, it would be worth it
08:13

sech1

not by scratchpad/dataset access
08:14

sech1

Apple M1 is already 5nm though
08:14

sech1

so maybe only 2x over that
08:15

hyc

no way will any ASIC get to more advanced processes before the big CPU houses do
08:15

Quarky93TongWu[m

<sech1 "Apple M1 is already 5nm though"> yea but also 80% of the die size are not cores
08:16

Quarky93TongWu[m

<hyc "no way will any ASIC get to more"> but Bitmain is constantly ahead on nodes, their current S9/Pro is 5nm right?
08:16

hyc

is it?
08:17

sech1

7nm
08:17

Quarky93TongWu[m

cost structure is simplified a lot when you can just produce and use it yourself. no need to market, ship, pay tarifs..
08:17

sech1

bitmain.com/news-detail/forge-ahead-with-determination
08:17

hyc

they are lower priority customers than Apple
08:17

sech1

"On 27 February, Bitmain officially announced the Antminer S19 and S19 Pro. Equipped with a custom-built 7nm chip from Bitmain"
08:18

sech1

And even when Bitmain gets hands on 5nm, it'll be used for Bitcoin ASICs
08:18

Quarky93TongWu[m

<sech1 ""On 27 February, Bitmain officia"> fair enough
08:18

sech1

5nm is already fully booked for more than a year ahead
08:18

sech1

currently Apple, soon AMD with NVIDIA
08:18

Quarky93TongWu[m

<sech1 "And even when Bitmain gets hands"> agreed, unless monero flips Bitcoin haha
08:19

sech1

by the time Monero flips Bitcoins, it'll be mined by hundreds of millions of CPUs, many of them with "free" electricity
08:20

sech1

a few dozen thousands of ASICs would be a drop in the ocean
08:20

sech1

at worst it'll be like Ethereum where ASICs exist but mining stays very profitable for everyone
08:21

Quarky93TongWu[m

I have a feeling GPU eth miners will be pushed out soon when gas fees settle down
08:21

sech1

latest RTX GPUs have very high ETH efficiency, almost the same as ASICs
08:22

Quarky93TongWu[m

twitter.com/Antminer_main/status/1383698802872184839/photo/1
08:23

Quarky93TongWu[m

Bitmain likes to embellish, but it this is ballpark correct, it's still several times more efficient than RTX 3000
08:23

Quarky93TongWu[m

* Bitmain likes to embellish, but if this is ballpark correct, it's still several times more efficient than RTX 3000
08:24

Quarky93TongWu[m

3GH/s @ 2556W
08:25

sech1

I haven't seen official numbers yet
08:26

hyc

Why would Nvidia bother to create a mining-only product line
08:26

hyc

they know years in advance what the tech landscape will be
08:27

Quarky93TongWu[m

<hyc "Why would Nvidia bother to creat"> to avoid used cards flooding the market, affecting sales of their next-gen
08:28

Quarky93TongWu[m

that's one of the reasons turing didn't do that well
08:28

Quarky93TongWu[m

1080 ti and pascal in general was cheap and available in part because they were good cards, but also due to mining boom in 2017 and crash in 2018
08:32

Quarky93TongWu[m

anyway, i don't want to come off as critical of RandomX, you guys really came up with something cool here, and it seems to be the best ASIC resistant solution so far.
08:42

hyc

it's important to keep watching the tech space to see where challenges could arise
08:43

hyc

but we already do that...
08:44

Quarky93TongWu[m

<hyc "it's important to keep watching "> agreed 🙂
08:54

Quarky93TongWu[m

if we get a RISC-V JIT compiler, I'd like to do some simulations in QEMU for fun :P
08:55

sech1

it's in the plans
08:55

sech1

can QEMU run some RISC-V?
08:55

Quarky93TongWu[m

yep
08:57

sech1

someone promised to send me RISC-V board: xmrig/xmrig #1924
08:57

sech1

let's see how it goes, I haven't heard anything yet
08:58

Quarky93TongWu[m

RISC-V support is in upstream QEMU now.
08:58

Quarky93TongWu[m

risc-v-getting-started-guide.readthedocs.io/en/latest/linux-qemu.html
08:59

Quarky93TongWu[m

it's also possible to put a linux bootable riscv on FPGA
08:59

Quarky93TongWu[m

github.com/black-parrot/black-parrot
09:00

Quarky93TongWu[m

tho it's not optimized for it
16:48

yanmaani

Has anyone tried to run a randomX program through Clang or GCC, see what they find?
16:49

yanmaani

Or do conventional implementations already optimize it as best they can?
17:07

yanmaani

I see there's code to generate C code in tevador/RandomX
17:12

hyc

not relevant
17:12

yanmaani

???
17:12

yanmaani

For example, a C compiler might be able to do smart register scheduling, or elimination of dead code, or similar, better than a naïve jit compiler could
17:14

moneromooo

Fast enough to be of any use ?
17:16

yanmaani

moneromooo: well that's my question - anyone tried it?
17:16

hyc

and the answer is no.
17:17

hyc

this is already explained in the Design doc.
17:17

hyc

there isn't enough time for complex optimizations
17:18

hyc

have you ever timed the startup time of clang? just to do nothing at all but to page the executables into memory and start running?
17:18

yanmaani

hyc: Right, but if clang is able to find optimizations, then it seems easy-ish to make a tiny "compiler" with only the applicable optimizations too.
17:18

hyc

lol no
17:19

hyc

none of the optimizations you mention are easy. they all require multiple-pass analysis
17:19

yanmaani

not necessarily, of course, but if the optimizations turn out to be simple ones...?
17:19

hyc

then they're of low impact
17:19

hyc

dead code elimination, seriously?
17:20

hyc

that requires complete control-flow analysis of the code start-to-finish
17:20

hyc

by te time you did that, the entire code could have executed in a JIT already
17:20

moneromooo

peephole is most likely single pass.
17:20

Quarky93TongWu[m

Does the JIT currently do any optimisations at all or is it essentially a direct translation?
17:21

hyc

very little
17:21

hyc

peephole could do some instruction fusion, yeah
17:21

Quarky93TongWu[m

Most uArch front ends do fusion anyways
17:22

hyc

but that's not single-pass. once you've identified points to optimize, you must go thru and relocate all affected address references
17:22

hyc

right - so it's usually wasted effort
17:22

moneromooo

Oh, good point.
17:23

hyc

remember, I was a gcc maintainer for ~10 years. I've been there and done that, many times.
17:24

hyc

you will always lose more time in code translation than you will regain in code execution time
17:25

hyc

AOT optimization is cost-effective when the resulting code is reused a lot.
17:25

hyc

a RandomX program for a given nonce only gets used once, then it's discarded forever
17:28

sech1

yanmaani hyc speaking of micro-optimizations, there are some in xmrig JIT compiler
17:28

sech1

like removing redundant CFROUND instructions
17:29

sech1

but they are all single-pass O(1) complexity code, otherwise it's too slow for JIT
17:29

hyc

right
17:30

sech1

removing CFROUND gave only 0.05% speedup
17:30

sech1

because not many programs have more than 1 CFROUND without FP instructions between them
17:30

hyc

yeah, talk about edge cases
17:32

hyc

we don't have a NOP opcode do we
17:32

sech1

we do, but it has frequency 0
17:33

sech1

even removing CFROUND is done by overwriting it with NOPs because moving generated code and fixing offsets is too slow
17:33

sech1

JIT compiler time constraints are very tight
17:34

hyc

yeah as I'd expect
17:35

sech1

of course if there's some peephole optimization that gives +10% in the main loop, JIT can sped a lot of time to do it. But I'm not aware of any such optimization and I spent 17 months tinkering with RandomX :D
17:35

sech1

actually more than 17 months, 2 years already
17:37

yanmaani

So what speedup does a modern compiler get if you allow it to optimize the main loop for 2 years straight on a modern supercomputer?
17:37

sech1

I haven't seen a compiler that can beat asm optimization by hand, given the programmer is experienced enough
17:38

sech1

and I can't see much places in the optimized code that can be optimized
17:38

sech1

*in the generated code
17:38

yanmaani

what would hand-optimizing the asm get you then?
17:39

sech1

Maybe +1-2% if you can spot places where instructions cancel each other
17:39

sech1

RandomX programs have a lot of branch instructions which makes them a "train" of short pieces of code divided by branches
17:40

sech1

not much can be done in each piece
17:40

sech1

this is all valid for modern superscalar out-of-order CPUs
17:41

sech1

there are a lot of things that can be done for in-order CPUs like Cortex A53
17:41

sech1

RandomX JIT doesn't reorder instructions at all and relies on CPU to do that
17:42

yanmaani

sech1: Does the optimization for A53 improve things, or just bring it back to the level of an OOE CPU?
17:42

sech1

so I estimate it's possible to get +50% by hand-optimizing generated code for A53
17:43

sech1

it doesn't matter much because 5 h/s vs 7.5 h/s is laughable anyway
17:44

yanmaani

aren't a53s much cheaper though?
17:48

hyc

not enough cheaper to be worth that
17:48

sech1

en.wikichip.org/wiki/arm_holdings/microarchitectures/cortex-x1 looks much more interesting
17:48

sech1

up to 8 MB cache per cluster, so a cluster of 4 cores + 8 MB cache would be perfect
17:49

Quarky93[m]

<sech1 "RandomX JIT doesn't reorder inst"> So in theory the Apple M1 should do an amazing job. If it wasn’t on macOS >.>
17:49

sech1

plus wide out of order design
17:49

sech1

Cortex-X1, Apple M1 are the most efficient per watt on RandomX right now
17:49

sech1

mostly thanks to 5nm node they use
17:50

yanmaani

does anyone sell boards with like a trillion ARM processors on them
17:50

Quarky93[m]

Extremely deep reorder buffer plus very wide decode stage
17:50

yanmaani

kind of like xeon phi but chinese
17:51

Quarky93[m]

<yanmaani "kind of like xeon phi but chines"> Mmm the Japanese industry always make interesting stuff
17:51

hyc

the closest would be Phytium FT-2000 64-core chip
17:52

Quarky93[m]

A64FX comes to mind but very expensive. Low volume part only for their Fugaku super computer
17:53

hyc

128 cores in MT2000+ appsbuilders.org/news/tsmc-to-stop-…ter-tianhe-3-in-question-servernews
17:53

hyc

but production is canceled now due to US trade sanctions
17:55

hyc

didn't realize their quadcore chip had gone into production tomshardware.com/news/arm-phytium-ft-2000-cpu-chinese-gaming-pc
17:55

Quarky93[m]

There’s also Xilinx Versal ACAP. With 400 VLIW cores + FPGA but it suffers from only 100Mb of cache for the VLIW cores
17:56

Quarky93[m]

191Mb if you count fpga block rams but those are a few clock cycles away
17:57

hyc

FPGAs won't cut it
17:58

Quarky93[m]

I know but these are hardened cores
17:58

hyc

we've discussed Xilinx Versal in here before
17:58

hyc

the cores aren't adequate, even if you interfaced to enough external RAM
18:03

yanmaani

as monero gets bigger, would it eventually spur cheap bulk CPUs specifically optimized for randomx?
18:04

hyc

maybe.
18:05

hyc

I suppose if you treat the majority of cores as co-processors instead of peer CPUs the control netowrk would be simpler
18:05

hyc

that's still the major limiting factor in putting more cores on a chip
18:06

hyc

these cores don't need to run any OS, just point them at a chunk of dataset RAM and a chunk of randomx code and let them run
18:07

hyc

so they don't need a lot of the multi-processor comms support that would usually be needed
18:07

Quarky93[m]

i guess a many core cpu designed for randomx could also be binned extremely well, as really no individual core is critical
18:08

Quarky93[m]

memory controllers also don't need to feed cores on the other side of the chip, just ones in its local area, as long as there is at least 2GB per memory controller
18:09

hyc

we will probably up the RAM requirement to 4GB within a year or two
18:09

Quarky93[m]

ah ok
18:10

Quarky93[m]

seems reasonable, is there any reason not to go 8GB?
18:10

hyc

I wouldn't do it yet. 4GB is still feasible for a lot of smartphones. 8GB is pushing it.
18:11

yanmaani

8gb is also infeasible for most computers running an OS
18:11

hyc

true
18:12

Quarky93[m]

does anyone really still run 8GB with a CPU fast enough to be profitable with mining?
18:13

Quarky93[m]

aside from base model macbook >.>
18:14

hyc

I ordered my M1 macbook with 16GB RAM, but the base model is only 8GB
18:14

hyc

and just booting up to desktop 11GB of RAM is in use
18:14

hyc

I suppose a lot of that is swapped out on an 8GB machine
18:14

hyc

would not be a pleasant experience
18:16

Quarky93[m]

fair enough, if you where actively using the computer while mining it'd be pretty terrible
18:16

Quarky93[m]

* fair enough, if you where actively using the computer while mining it'd be pretty terrible, even for 16GB machine
21:17

gingeropolous

could you make a randomx coprocessor that the primary CPU could use effectively?
21:17

gingeropolous

like, where else in contemporary computing do you have a series of random functions that need to be executed as fast as possible?
21:18

gingeropolous

i guess if you used machine learning to optimize some signal to output relationship
21:19

gingeropolous

but is that what you were talking about in different words?
21:20

gingeropolous

probably a torus
21:37

yanmaani

gingeropolous: isn't a "randomx coprocessor" just a processor?
21:41

gingeropolous

well like y'all were saying, you rip all the stuff out that ends up useless. like the branch prediction. i remember seeing images of CPUs and the branch predictors eat so much die
21:44

yanmaani

branch prediction is sorta needed though?
21:46

hyc

yes
21:47

hyc

the stuff I would rip out is everything that supports an OS - interrupt handlers, permission levels
21:48

yanmaani

does that even use a lot of die space?
21:49

hyc

if you look at all of the security bugs Intel has had to deal with in the past year or two
21:49

hyc

spectre etc... yes.
21:49

yanmaani

I'm sort of wondering why supercomputers don't do that. I mean, they don't have to bother with permissions right?
21:50

hyc

the reason those bugs existed is because Intel tried to cut corners where they shouldn't have
21:51

hyc

supercomputers still tend to run multiuser OSs so they still have to deal with that
21:51

hyc

Crays ran Unix. modern supercomputers run linux
21:52

yanmaani

hyc: sure, cause it's a lower cost
21:53

yanmaani

develop new OS + develop interrupt-less CPU < buy normal CPUs and use normal Linux
21:56

yanmaani

cause I think Linux can run on e.g. MMU-less systems, so it seems like cutting protection would be a no-brainer if it had significant savings
21:57

hyc

hm. it's been a long time since I tried to run linux on a machine with no MMU, dunno if that's still supported
21:57

yanmaani

it's supported by linux, but not by glibc
21:57

yanmaani

so you have to run uclibc, and you have certain limitations on mmap etc
21:57

hyc

ah
21:57

hyc

cool
21:58

hyc

glibc is such a pig
21:58

hyc

the other day I wrote a little toy that was 300 bytes long. 16K when linked with glibc
21:59

hyc

I rewrote it to use syscall() instead of the usual library calls, got it back down to 500 bytes
23:25

yanmaani

why not just use musl?

4 years ago

« a day earlier

a day later »

today »