03:38:46 which mrmsr for opteron? 03:46:26 I see servethehome already has some randomx discussion https://forums.servethehome.com/index.php?threads/monero-changed-to-the-randomx-algo.26589/ 03:47:01 gingeropolous: who knows if the opteron even had an aggressive prefetcher to disable in the first place 03:47:12 AMD tended to lag behind on that compared to Intel 03:48:12 in either case the prefetchers are designed to recognize repetitive access patterns. a simple orefetcher may only recognize purely sequential accesses. 03:48:25 if it doesn't recognize the access pattern, it goes inactive 03:49:15 for a highly random access pattern like randomX, it ought to deactivate itself 03:50:43 I guess Bulldozer generation was improved https://www.realworldtech.com/bulldozer/9/ 03:53:45 oh here you go https://www.amd.com/en/support/tech-docs 03:53:54 Volume 2: System Programming, Appendix A 03:53:59 has listing of all MSRs 03:57:07 well, not all. nothing about prefetchers 04:00:17 Maybe you'll find it in the BIOS and Kernel Dev's GUide for your processor family 05:04:12 this bios is useless. 05:04:32 i'll be happy with 2.5 kh/s per 10 year old processor 05:34:03 total miners 28316 06:23:27 oh look. MoneroV lives! In a Frankenstein-meets-living-dead sort of way. And guess what, they are moving to RandomV - 'a derivative' of RandomX - https://monerov.online/MoneroV-Swap.pdf 06:24:18 RandomV = RandomX, I checked their changes 06:24:45 They only changed ArgonSalt parameter 06:25:19 That would be my assumption. You get a notification when it gets forked on github? 06:25:52 No, I found them in twitter when I looked for RandomX hashtag 06:26:26 Lol 06:27:10 I'm still waiting for them to blow me away with their mimblewimble implementation 06:36:42 Hmm, people are reporting both increased hashrate _and_ lower power after MSR mode, nice! 09:44:53 I wonder if Arweave will work your optimizations in 10:42:01 lol, it is seb heading up MoneroV Reloaded 10:54:07 who is seb? 10:54:17 Antony is the most active TG user in their group 11:00:24 a shady multi-shitcoin pool operator 11:03:25 I'm torn. I kind of feel pity for people having put money into XMV, but on the other hand it is such an obvious scam that I also kind of don't ... 17:03:00 ugh. how do we do that power calculation for nethash 17:04:00 850 million hashes per second * 50 watts per hash per second / 60 minutes = MWh? 17:07:08 50W per hash is insanely bad 17:07:35 my 3700x is around 100h/w 17:07:58 7700h/s @ 77W total package draw 17:07:59 stock Ryzen is 50 H/J, I would't say insanely bad 17:08:18 please read again. 17:08:30 50W per hash, not 50 hash per W 17:09:21 yeah, gingeropolous now you know the problem with your formula 17:09:40 fort3hlulz: package power is a useless metric 17:10:05 What's a more useful metric? W at the wall? 17:10:09 Haven't captured that yet 17:10:11 yes 17:11:02 I've just been looking at tuning CPU/RAM variables to up efficiency (but haven't implemented any changes yet) 17:11:14 So just using package power as a guide to efficiency of the CPU itself 17:13:35 ok, so 850e6 hashes/second * 1 joule / 50 hashes = 17 mega joule/sec = 17 MW 17:14:24 yes, sounds about right 17:15:49 was thinking earlier today, if we wrap randomx-benchmark with perf counter monitoring, it would be quite an excellent CPU+memory benchmark 17:16:17 e.g. a simple wrapper script that runs a couple times while monitoring cache hit rates, branch rates, etc. 17:16:51 should be simple with linux `perf` 17:17:50 and send shares to the dev fund? :P 17:18:06 heh 17:22:27 17.9 kh/s on 3950X https://www.reddit.com/r/MoneroMining/comments/e9tuvd/randomx_boost_guide_for_ryzen_on_windows_9100_hs/fao4fyb/ 17:22:52 So it's almost 9000 h/s per memory channel 17:24:09 no mention of memory timings? 17:25:21 https://imgur.com/a/yDwlNvn 17:25:37 from his earlier posts 17:25:49 Ryzen 3950x @4GHz with 32GB memory @ 3600MHz timings 15-14-14-14-32-2T (as tight as I could get) 17:26:10 and he was at 15.45 kh/s when he posted it 12 days ago 17:26:26 4.4GHz CPU sheesh 17:26:39 y'think he could sustain that for over 24 hours? 17:26:44 +15.8% on his system (same 4 GHz clock) 17:26:48 no, he runs at 4 GHz now 17:27:05 ah ok 17:27:27 and the most important thing: power? 17:28:48 it shouldn't be a lot at 4 GHz 17:29:46 he posted https://imgur.com/a/9nHQDkO yesterday - it shows PPT "61% of 200W", so package power 122W 17:30:03 but it was 3.875 GHz 17:30:40 hello, I have been lurking the channel using monerologs.net but I'm the guy who posted to reddit with a nickname of mmrdx 17:32:26 have you got measurements of power at the wall? 17:32:27 my memory timings have been improved from the previous post to 14-14-14-14-28, the single rank b-die 3200 fast preset using dram calculator but applied to dual rank 3600 17:32:32 hello mortti 17:32:56 dual rank 3600 @ 14-14-14-28? This is insane 17:33:02 very nice memory timings indeed 17:33:10 yeah, it took 1.45v to get it stable 17:33:50 Next thing is to lurk SuperPi benchers' forums and get an idea how they run 4000 @ 12-12-12-... :D 17:34:26 4000 @ 12 now *that* sounds insane 17:34:31 I have lot of stuff drawing power on this workstation but at 3.85GHz I measured 270W at the socket 17:36:17 puts it close to around 60H/W. sacrificed a lot of efficiency for that rate 17:36:58 there is two radeon vegas, 8 fans, water cooling pump, etc. attached to this rig 17:37:21 but I agree, efficiency is best at 3.6GHz or so 17:43:32 hyc https://www.techpowerup.com/forums/threads/share-your-aida-64-cache-and-memory-benchmark-here.186338/page-25#post-3515562 17:43:46 3750 MHz 12-11-11-28 1.8V 17:45:03 I was also able squeeze 5200H/s from sandy bridge xeon 1680v2 @4GHz with ddr3 :) 17:46:10 total miners 28601 17:49:13 15 kh/s at < 200W: https://bitcointalk.org/index.php?topic=5203616.msg53337979#msg53337979 19:16:09 What is the source of these numbers in RandomX/README.md: "* DDR4 memory is limited to about 4000-6000 H/s per channel"? How can i reproduce it? 19:20:38 These are old numbers. It's over 9000 h/s per channel now. 19:22:13 The question is about the source and how to reproduce them? 19:22:47 I suppose it wasn't just measurement in real life experiments. 19:24:43 https://i.imgur.com/DFeEqbP.jpg 19:25:02 https://www.reddit.com/r/MoneroMining/comments/e7dm1p/amd_3950x_final_results_of_hashrate_after/fagozzk/ 19:47:24 I saw it somewhere. Finally, found it in RandomX/doc/design.md. 19:49:29 The formula is : 1 second / ( DRAM single bank read latency 40 nanosecond) / (16384 reads from dataset / 1 hash) * (banks in 1 raw module 8) = 1 / 40e-9 / 16384 * 8 ~= 12207 H/s 19:49:54 banks in 1 RAM module * 19:50:25 So it isn't 6000 even theoretically. 19:51:35 cohcho: you are wrong about 8 banks per module 19:51:54 Let me check doc about any real world ram again 19:52:15 it's bank GROUPS, not banks 19:52:28 DDR3 has only 1 bank group per channel 19:52:32 DDR4 has 2-4 19:54:57 it doesn't matter how many banks the group has because each group shares the same control circuitry, so there can be only 1 pending read per group 19:56:40 "The DDR4 architecture is an 8n prefetch with two or four selectable bank groups. This design will permit the DDR4 memory devices to have separate activation, read, write or refresh operations underway in each unique bank group." 20:17:40 I still need some proof that this fundamental access latency limit 50ns is unbreakable for real applications, rereading this paper : http://prof.icc.skku.ac.kr/~jaewlee/pubs/isca13_charm.pdf 20:23:15 as you can see, it's not a hard limit, it can be improved with tighter memory timings 20:23:21 but there IS a limit 20:27:03 Intel CPUs can have DRAM latency as low as 40 ns 21:10:09 One thing keeps me wondering... With MSR mod everyon get 5% or more speed increase AND power reduction of ~5 watts 21:10:33 So what's the actual power spent on calculations and what's the power spent on memory access/unrelated stuff? 21:11:32 Did you unset the "NSA_BACKGROUND_MONITORING" bit ? :) 21:12:09 maybe, all changed bits are undocumented for Ryzens :D 21:13:17 or was it just hardware prefetchers that worked all the time, consuming power to reduce hashrate :D 21:13:26 IIRC I measured something like 10+ watts just from memory accesses 21:13:39 you can benchmark it by removing dataset accesses 21:14:03 now the next thing is to find MSR to turn off branch prediction, right? 21:14:18 I'm sure they keep it hidden somewhere... 21:14:30 it's not necessary to disable it 21:14:55 it would make sense only if we had many 50/50 branches 21:15:55 anyways, hopefully the next gen CPUs will have a "RandomX" MSR 21:16:01 :P 21:16:23 Well, BIOS manufactures can add it to their "Performance bias" options list 21:16:38 They have it for Geekbench/Cinebench 21:16:53 and they have NDA docs to optimize it further 21:17:05 I can already see "mining BIOS" beta versions 21:18:08 Hmm. That'd be an idea. Ask them "what is the set of MSR values that maximizes hash/watt". They don't have to document stuff, and their CPUs look better. Though I suppose the values could be pretty exact-cpu-version dependent. 21:18:56 as far as I can see, AMD tries to keep their MSR internals consistent within the same CPU family 21:19:07 the description of the prefetchers says it looks tries to detect strided access patterns 21:19:13 the big change was between Bulldozer and Ryzen family 21:19:29 would assume when it detects no such pattern, it does nothing. 21:20:01 working prefetcher consumes power and trigger sometimes, doing unneeded memory requests 21:20:03 but if toggling the MSR affects it so much, that implies either that it's seeing false access patterns in RandomX's accesses, or it's prefetching all the time even when it doesn't know what to prefetch 21:20:08 so lower hashrate and higher power 21:20:25 seems like quite a high power cost. 21:20:45 for what is essentially just maintaining a list of addresses 21:20:48 address offsets. 21:23:04 I think the prefetchers are pretty dumb, they probably load at least one cache lines directly after the one read from the dataset 21:23:39 it's even called "stream prefetcher" 21:23:58 hm yeah that'd make more sense 21:24:55 that would have the effect of doubling the bandwidth usage and slightly increasing the latency, which could account for the 5% perf and 5W power diff 21:25:00 Overall it makes sense. I have the analagous setting in LMDB to disable OS prefetching when datasets are larger than RAM 21:26:34 by default, when the kernel is forced to read a page of data, it prefetches 16 pages at a time. if RAM is full, the extra pages may evict useful data from cache and bring in useless data. 21:27:13 so as a rule, prefetch is a good idea for sequential access patterns, and horrible otherwise. 21:28:16 Is it possible to write randomx dataset into filesystem file and open read only shared mmap with restriction 2GiB only? 21:28:27 for devices that don't have 2GiB + 80MiB of free RAM? 21:28:42 possible, yes. performance will suck. 21:29:10 I need exact numbers comparing to cache only 21:29:18 It should be better but how much 21:29:21 I don't know 21:29:35 you should be able to find a suitable device for testing on 21:29:58 page faults will take at least 30ms for a typical HDD 21:29:59 How to limit mmap size in linux? 21:30:42 you don't need to limit mmap size, you're going to mmap exactly the amount of RAM you would normally require. 21:31:09 the trick is to simply eat up the rest of RAM until you've shrunk free RAM down to the desired size. 21:31:18 I've asked since process being killed after the frist try to initialize dataset. 21:31:29 you can do this just by writing a program that mmaps the rest of RAM privately, and mlocks it all 21:31:56 do you have any swap space configured? 21:32:03 No 21:32:13 note that adding sufficient swap should give you the same performance as a read-only mmap'd file in this use case 21:32:24 With c++ based implementation of randomx It is hard to free ram for other things except dataset 21:32:47 I agree with you about not using c++ 21:32:51 you don't need to alter the randomx code at all 21:33:19 anyway, I have tested on my ARM64 boxes with only 2GB of RAM, and swap enabled on a microSD card. 21:33:24 it is quite painfully slow. 21:33:36 There is no way to setup swap 21:34:23 yeah, you will get about 0.01 H/s 21:34:41 I think gingeropolous tested it before 21:35:08 it's much faster to use the 256 MB mode 21:35:12 Another implementation is required in order have hybrid dataset : partially in-memory, partially recomputed on the fly 21:35:24 But it requires some time to implement 21:35:58 Yeah I guess that could be worthwhile. for 50% of dataset, rest using 256MB cache. 21:36:09 for the accesses that land in the dataset memory, you win. 21:36:09 the benefit of this will be smaller than you think 21:36:25 50% dataset will be not even 2 times faster than the light mode 21:36:27 50% dataset still means a 4x slowdown 21:37:51 replace 50% with configurable parameter that should be set in runtime to free ram 21:37:57 It will be in practise close to 2GiB 21:37:59 not 50% 21:38:11 4x slowdown is still faster than light mode 6x slowdown 21:44:34 I've tested single core performance with MSR mod and got 1287 h/s, so 10296 h/s is possible without memory bottleneck 21:45:03 which means real-life 9670 h/s is still 6% slower than it can do in theory 21:51:01 I don't like the fact that these upper boundary limits is changing faster than I expect 21:54:29 why does it matter to you? 21:54:37 it's only a theoretical limit, I don't think everything will scale linearly to all cores