00:34:09 hyc: I wasn't accusing you of anything, but it wasn't in design.md so... 00:34:29 Has anyone tried to run a randomX program through Clang/GCC, see what they find/ 00:34:32 ?* 04:17:15 So TSMC 5nm is capable of 32Mb/mm^2 SRAM (with control logic), a chip could be made with hundreds of randomx instances. It would be expensive of course but Bitmain already uses 5nm and is probably taping out 3nm as we speak. 04:20:14 A zen 2 core is about the same size as 4MB of cache. You could slim it down drastically by getting rid of branch prediction, and unnecessary instruction extensions 04:29:48 * So TSMC 5nm is capable of 32Mb/mm^2 SRAM (with control logic), a chip could be made with hundreds of randomx instances. It would be expensive of course but Bitmain already uses 5nm and is probably taping out 3nm as we speak (for Bitcoin of course). 04:33:49 I wonder if a sea of processors approach with many small simple cores + SRAM (SRAM is much smaller than associative cache), I guess like a modern Xeon Phi would be a viable ASIC. 04:33:49 Even though each instance of RandomX is single threaded, the nature of mining/pow itself is embarrassingly parallel. 04:35:39 Seymour Cray famously said "If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens?". Looking at the history of parallel computing it's pretty clear the chickens have won. 04:40:16 Even so it would be hard to see several orders of magnitude efficiency increase with ASICs, so I guess RandomX is still the best PoW in that regard. 04:48:03 im thinking something like this https://www.parallella.org/docs/e5_1024core_soc.pdf 04:48:04 but scaled to 5nm, with a higher SRAM-to-core ratio 04:55:01 I'm not criticising RandomX to be clear... Just vomiting my thoughts from a hardware perspective 07:50:33 You're talking about chip size savings, but efficiency per watt will be approximately the same, at most 2-3x better. Many small cores will also suffer from DRAM access penalties - it's very hard to create efficient memory controller that can handle thousands of parallel accesses 07:58:37 yea, i agree, the efficiency will not be orders of magnitude more than consumer CPUs. Especially when accessing off-chip memory (e.g. DDR4). 07:58:37 Although it should be noted that, HBM is several times more efficient than DDR4. There is also research in bringing compute to memory chips themselves. 07:58:41 https://news.samsung.com/global/samsung-develops-industrys-first-high-bandwidth-memory-with-ai-processing-power 07:59:06 Intel Saphire Rapids will bring HBM to server CPUs tho 08:01:03 Also, if you have many slow cores, that leaves more room for re-ordering memory transactions as each program iteration will take more time 08:01:29 * Also, if you have many slow cores, that leaves more room for re-ordering memory transactions (within the memory controller) as each program iteration will take more time 08:02:45 The thing is, chips with HBM are prohibitively expensive, even AMD/NVIDIA do this only for their datacenter GPUs 08:03:29 By the time ASIC manufacturers are able to use all this magic tech, AMD/Intel/NVIDIA will already have it in consumer chips 08:03:55 it will be more common very soon, fundamentally it's not any more cost to produce than DDR, it's just a question of scale. 08:05:28 Also, in RandomX something like 90% of energy is spent by CPU 08:05:37 processor-in-memory is pretty well studied https://twitter.com/hyc_symas/status/1171800102471532544 08:05:57 you're talking about flocks of tiny cores, but none of these can do 64bit arithmetic 08:07:39 it also doesn't make sense to make a core much smaller than 2 MB SRAM it needs to use 08:07:45 and that is not a small core already 08:08:20 Well the Epiphany-V does do 64-bit floating point, just not enough cache to run RandomX for every core 08:09:05 I'm surprised APple didn't use HBM for the M1's on-chip memory 08:09:44 of all companies, they could afford to make such a decision, at this point in history 08:10:03 something like Cortex A-55 modification + 2 MB SRAM instead of data caches, all that on 5nm would be optimal 08:10:08 true, logic transistors are cheaper than memory transistors. however you could still strip out alot of cruft from the typical x86 core and just put in more ALUs 08:10:33 more ALUs doesn't help without more bandwidth 08:10:46 more ALUs is not even needed, just put more cores instead 08:10:51 Intel x86 has always been bandwidth-starved 08:11:14 that moves the problem to a super-efficient memory controller, and those IPs are very expensive and/or well guarded 08:11:45 and get rid of the cache tagging, snooping and victim cache logic. still pretty significant savings 08:11:47 Unless they do something like distributed on-chip memory + fast interconnect 08:12:52 getting rid of the cache logic would just make software harder to write. it's been done before. 08:13:04 remember the Cell engine 08:13:15 2-3x overall efficiency improvement over Ryzen or Apple M1 is possible 08:13:33 it'll all be limited by how efficient ALUs they can make 08:13:35 yea but if you're just interested in running randomX, it would be worth it 08:13:42 not by scratchpad/dataset access 08:14:38 Apple M1 is already 5nm though 08:14:49 so maybe only 2x over that 08:15:19 no way will any ASIC get to more advanced processes before the big CPU houses do 08:15:27 yea but also 80% of the die size are not cores 08:16:17 but Bitmain is constantly ahead on nodes, their current S9/Pro is 5nm right? 08:16:47 is it? 08:17:09 7nm 08:17:13 cost structure is simplified a lot when you can just produce and use it yourself. no need to market, ship, pay tarifs.. 08:17:37 https://www.bitmain.com/news-detail/forge-ahead-with-determination 08:17:43 they are lower priority customers than Apple 08:17:45 "On 27 February, Bitmain officially announced the Antminer S19 and S19 Pro. Equipped with a custom-built 7nm chip from Bitmain" 08:18:08 And even when Bitmain gets hands on 5nm, it'll be used for Bitcoin ASICs 08:18:18 fair enough 08:18:47 5nm is already fully booked for more than a year ahead 08:18:54 currently Apple, soon AMD with NVIDIA 08:18:59 agreed, unless monero flips Bitcoin haha 08:19:49 by the time Monero flips Bitcoins, it'll be mined by hundreds of millions of CPUs, many of them with "free" electricity 08:20:03 a few dozen thousands of ASICs would be a drop in the ocean 08:20:23 at worst it'll be like Ethereum where ASICs exist but mining stays very profitable for everyone 08:21:03 I have a feeling GPU eth miners will be pushed out soon when gas fees settle down 08:21:42 latest RTX GPUs have very high ETH efficiency, almost the same as ASICs 08:22:45 https://twitter.com/Antminer_main/status/1383698802872184839/photo/1 08:23:41 Bitmain likes to embellish, but it this is ballpark correct, it's still several times more efficient than RTX 3000 08:23:51 * Bitmain likes to embellish, but if this is ballpark correct, it's still several times more efficient than RTX 3000 08:24:29 3GH/s @ 2556W 08:25:29 I haven't seen official numbers yet 08:26:35 Why would Nvidia bother to create a mining-only product line 08:26:54 they know years in advance what the tech landscape will be 08:27:12 to avoid used cards flooding the market, affecting sales of their next-gen 08:28:07 that's one of the reasons turing didn't do that well 08:28:53 1080 ti and pascal in general was cheap and available in part because they were good cards, but also due to mining boom in 2017 and crash in 2018 08:32:00 anyway, i don't want to come off as critical of RandomX, you guys really came up with something cool here, and it seems to be the best ASIC resistant solution so far. 08:42:31 it's important to keep watching the tech space to see where challenges could arise 08:43:52 but we already do that... 08:44:10 agreed 🙂 08:54:50 if we get a RISC-V JIT compiler, I'd like to do some simulations in QEMU for fun :P 08:55:13 it's in the plans 08:55:23 can QEMU run some RISC-V? 08:55:33 yep 08:57:42 someone promised to send me RISC-V board: https://github.com/xmrig/xmrig/issues/1924 08:57:53 let's see how it goes, I haven't heard anything yet 08:58:27 RISC-V support is in upstream QEMU now. 08:58:28 https://risc-v-getting-started-guide.readthedocs.io/en/latest/linux-qemu.html 08:59:53 it's also possible to put a linux bootable riscv on FPGA 08:59:53 https://github.com/black-parrot/black-parrot 09:00:21 tho it's not optimized for it 16:48:11 Has anyone tried to run a randomX program through Clang or GCC, see what they find? 16:49:51 Or do conventional implementations already optimize it as best they can? 17:07:27 I see there's code to generate C code in tevador/RandomX 17:12:00 not relevant 17:12:14 ??? 17:12:44 For example, a C compiler might be able to do smart register scheduling, or elimination of dead code, or similar, better than a naïve jit compiler could 17:14:38 Fast enough to be of any use ? 17:16:24 moneromooo: well that's my question - anyone tried it? 17:16:42 and the answer is no. 17:17:06 this is already explained in the Design doc. 17:17:17 there isn't enough time for complex optimizations 17:18:05 have you ever timed the startup time of clang? just to do nothing at all but to page the executables into memory and start running? 17:18:42 hyc: Right, but if clang is able to find optimizations, then it seems easy-ish to make a tiny "compiler" with only the applicable optimizations too. 17:18:51 lol no 17:19:09 none of the optimizations you mention are easy. they all require multiple-pass analysis 17:19:20 not necessarily, of course, but if the optimizations turn out to be simple ones...? 17:19:30 then they're of low impact 17:19:55 dead code elimination, seriously? 17:20:06 that requires complete control-flow analysis of the code start-to-finish 17:20:19 by te time you did that, the entire code could have executed in a JIT already 17:20:42 peephole is most likely single pass. 17:20:45 Does the JIT currently do any optimisations at all or is it essentially a direct translation? 17:21:00 very little 17:21:16 peephole could do some instruction fusion, yeah 17:21:55 Most uArch front ends do fusion anyways 17:22:12 but that's not single-pass. once you've identified points to optimize, you must go thru and relocate all affected address references 17:22:26 right - so it's usually wasted effort 17:22:27 Oh, good point. 17:23:00 remember, I was a gcc maintainer for ~10 years. I've been there and done that, many times. 17:24:48 you will always lose more time in code translation than you will regain in code execution time 17:25:16 AOT optimization is cost-effective when the resulting code is reused a lot. 17:25:34 a RandomX program for a given nonce only gets used once, then it's discarded forever 17:28:34 yanmaani hyc speaking of micro-optimizations, there are some in xmrig JIT compiler 17:28:42 like removing redundant CFROUND instructions 17:29:08 but they are all single-pass O(1) complexity code, otherwise it's too slow for JIT 17:29:17 right 17:30:02 removing CFROUND gave only 0.05% speedup 17:30:15 because not many programs have more than 1 CFROUND without FP instructions between them 17:30:26 yeah, talk about edge cases 17:32:11 we don't have a NOP opcode do we 17:32:53 we do, but it has frequency 0 17:33:15 even removing CFROUND is done by overwriting it with NOPs because moving generated code and fixing offsets is too slow 17:33:42 JIT compiler time constraints are very tight 17:34:24 yeah as I'd expect 17:35:08 of course if there's some peephole optimization that gives +10% in the main loop, JIT can sped a lot of time to do it. But I'm not aware of any such optimization and I spent 17 months tinkering with RandomX :D 17:35:18 actually more than 17 months, 2 years already 17:37:12 So what speedup does a modern compiler get if you allow it to optimize the main loop for 2 years straight on a modern supercomputer? 17:37:56 I haven't seen a compiler that can beat asm optimization by hand, given the programmer is experienced enough 17:38:12 and I can't see much places in the optimized code that can be optimized 17:38:18 *in the generated code 17:38:42 what would hand-optimizing the asm get you then? 17:39:05 Maybe +1-2% if you can spot places where instructions cancel each other 17:39:55 RandomX programs have a lot of branch instructions which makes them a "train" of short pieces of code divided by branches 17:40:01 not much can be done in each piece 17:40:49 this is all valid for modern superscalar out-of-order CPUs 17:41:13 there are a lot of things that can be done for in-order CPUs like Cortex A53 17:41:59 RandomX JIT doesn't reorder instructions at all and relies on CPU to do that 17:42:27 sech1: Does the optimization for A53 improve things, or just bring it back to the level of an OOE CPU? 17:42:36 so I estimate it's possible to get +50% by hand-optimizing generated code for A53 17:43:41 it doesn't matter much because 5 h/s vs 7.5 h/s is laughable anyway 17:44:56 aren't a53s much cheaper though? 17:48:32 not enough cheaper to be worth that 17:48:46 https://en.wikichip.org/wiki/arm_holdings/microarchitectures/cortex-x1 looks much more interesting 17:48:59 up to 8 MB cache per cluster, so a cluster of 4 cores + 8 MB cache would be perfect 17:49:08 So in theory the Apple M1 should do an amazing job. If it wasn’t on macOS >.> 17:49:11 plus wide out of order design 17:49:50 Cortex-X1, Apple M1 are the most efficient per watt on RandomX right now 17:49:56 mostly thanks to 5nm node they use 17:50:10 does anyone sell boards with like a trillion ARM processors on them 17:50:11 Extremely deep reorder buffer plus very wide decode stage 17:50:14 kind of like xeon phi but chinese 17:51:25 Mmm the Japanese industry always make interesting stuff 17:51:52 the closest would be Phytium FT-2000 64-core chip 17:52:29 A64FX comes to mind but very expensive. Low volume part only for their Fugaku super computer 17:53:27 128 cores in MT2000+ https://appsbuilders.org/news/tsmc-to-stop-release-of-arm-processors-phytium-the-fate-of-the-chinese-exascale-supercomputer-tianhe-3-in-question-servernews/ 17:53:43 but production is canceled now due to US trade sanctions 17:55:30 didn't realize their quadcore chip had gone into production https://www.tomshardware.com/news/arm-phytium-ft-2000-cpu-chinese-gaming-pc 17:55:55 There’s also Xilinx Versal ACAP. With 400 VLIW cores + FPGA but it suffers from only 100Mb of cache for the VLIW cores 17:56:53 191Mb if you count fpga block rams but those are a few clock cycles away 17:57:19 FPGAs won't cut it 17:58:01 I know but these are hardened cores 17:58:29 we've discussed Xilinx Versal in here before 17:58:45 the cores aren't adequate, even if you interfaced to enough external RAM 18:03:46 as monero gets bigger, would it eventually spur cheap bulk CPUs specifically optimized for randomx? 18:04:04 maybe. 18:05:32 I suppose if you treat the majority of cores as co-processors instead of peer CPUs the control netowrk would be simpler 18:05:59 that's still the major limiting factor in putting more cores on a chip 18:06:40 these cores don't need to run any OS, just point them at a chunk of dataset RAM and a chunk of randomx code and let them run 18:07:05 so they don't need a lot of the multi-processor comms support that would usually be needed 18:07:27 i guess a many core cpu designed for randomx could also be binned extremely well, as really no individual core is critical 18:08:43 memory controllers also don't need to feed cores on the other side of the chip, just ones in its local area, as long as there is at least 2GB per memory controller 18:09:48 we will probably up the RAM requirement to 4GB within a year or two 18:09:57 ah ok 18:10:33 seems reasonable, is there any reason not to go 8GB? 18:10:59 I wouldn't do it yet. 4GB is still feasible for a lot of smartphones. 8GB is pushing it. 18:11:17 8gb is also infeasible for most computers running an OS 18:11:51 true 18:12:21 does anyone really still run 8GB with a CPU fast enough to be profitable with mining? 18:13:16 aside from base model macbook >.> 18:14:03 I ordered my M1 macbook with 16GB RAM, but the base model is only 8GB 18:14:14 and just booting up to desktop 11GB of RAM is in use 18:14:34 I suppose a lot of that is swapped out on an 8GB machine 18:14:40 would not be a pleasant experience 18:16:13 fair enough, if you where actively using the computer while mining it'd be pretty terrible 18:16:46 * fair enough, if you where actively using the computer while mining it'd be pretty terrible, even for 16GB machine 21:17:24 could you make a randomx coprocessor that the primary CPU could use effectively? 21:17:52 like, where else in contemporary computing do you have a series of random functions that need to be executed as fast as possible? 21:18:44 i guess if you used machine learning to optimize some signal to output relationship 21:19:22 but is that what you were talking about in different words? 21:20:08 probably a torus 21:37:27 gingeropolous: isn't a "randomx coprocessor" just a processor? 21:41:38 well like y'all were saying, you rip all the stuff out that ends up useless. like the branch prediction. i remember seeing images of CPUs and the branch predictors eat so much die 21:44:16 branch prediction is sorta needed though? 21:46:52 yes 21:47:28 the stuff I would rip out is everything that supports an OS - interrupt handlers, permission levels 21:48:29 does that even use a lot of die space? 21:49:02 if you look at all of the security bugs Intel has had to deal with in the past year or two 21:49:09 spectre etc... yes. 21:49:49 I'm sort of wondering why supercomputers don't do that. I mean, they don't have to bother with permissions right? 21:50:03 the reason those bugs existed is because Intel tried to cut corners where they shouldn't have 21:51:00 supercomputers still tend to run multiuser OSs so they still have to deal with that 21:51:47 Crays ran Unix. modern supercomputers run linux 21:52:26 hyc: sure, cause it's a lower cost 21:53:03 develop new OS + develop interrupt-less CPU < buy normal CPUs and use normal Linux 21:56:29 cause I think Linux can run on e.g. MMU-less systems, so it seems like cutting protection would be a no-brainer if it had significant savings 21:57:07 hm. it's been a long time since I tried to run linux on a machine with no MMU, dunno if that's still supported 21:57:31 it's supported by linux, but not by glibc 21:57:46 so you have to run uclibc, and you have certain limitations on mmap etc 21:57:47 ah 21:57:51 cool 21:58:05 glibc is such a pig 21:58:35 the other day I wrote a little toy that was 300 bytes long. 16K when linked with glibc 21:59:22 I rewrote it to use syscall() instead of the usual library calls, got it back down to 500 bytes 23:25:28 why not just use musl?