03:38:46 <gingeropolous> which mrmsr for opteron?
03:46:26 <hyc> I see servethehome already has some randomx discussion https://forums.servethehome.com/index.php?threads/monero-changed-to-the-randomx-algo.26589/
03:47:01 <hyc> gingeropolous: who knows if the opteron even had an aggressive prefetcher to disable in the first place
03:47:12 <hyc> AMD tended to lag behind on that compared to Intel
03:48:12 <hyc> in either case the prefetchers are designed to recognize repetitive access patterns. a simple orefetcher may only recognize purely sequential accesses.
03:48:25 <hyc> if it doesn't recognize the access pattern, it goes inactive
03:49:15 <hyc> for a highly random access pattern like randomX, it ought to deactivate itself
03:50:43 <hyc> I guess Bulldozer generation was improved https://www.realworldtech.com/bulldozer/9/
03:53:45 <hyc> oh here you go https://www.amd.com/en/support/tech-docs
03:53:54 <hyc> Volume 2: System Programming, Appendix A
03:53:59 <hyc> has listing of all MSRs
03:57:07 <hyc> well, not all. nothing about prefetchers
04:00:17 <hyc> Maybe you'll find it in the BIOS and Kernel Dev's GUide for your processor family
05:04:12 <gingeropolous> this bios is useless.
05:04:32 <gingeropolous> i'll be happy with 2.5 kh/s per 10 year old processor
05:34:03 <gingeropolous> total miners 28316
06:23:27 <Inge-> oh look. MoneroV lives! In a Frankenstein-meets-living-dead sort of way. And guess what, they are moving to RandomV - 'a derivative' of RandomX - https://monerov.online/MoneroV-Swap.pdf
06:24:18 <sech1> RandomV = RandomX, I checked their changes
06:24:45 <sech1> They only changed ArgonSalt parameter
06:25:19 <Inge-> That would be my assumption. You get a notification when it gets forked on github?
06:25:52 <sech1> No, I found them in twitter when I looked for RandomX hashtag
06:26:26 <Inge-> Lol
06:27:10 <Inge-> I'm still waiting for them to blow me away with their mimblewimble implementation
06:36:42 <sech1> Hmm, people are reporting both increased hashrate _and_ lower power after MSR mode, nice!
09:44:53 <Inge-> I wonder if Arweave will work your optimizations in
10:42:01 <wow-discord> <wowario> lol, it is seb heading up MoneroV Reloaded
10:54:07 <Inge-> who is seb?
10:54:17 <Inge-> Antony is the most active TG user in their group
11:00:24 <wow-discord> <wowario> a shady multi-shitcoin pool operator
11:03:25 <Inge-> I'm torn. I kind of feel pity for people having put money into XMV, but on the other hand it is such an obvious scam that I also kind of don't ...
17:03:00 <gingeropolous> ugh. how do we do that power calculation for nethash
17:04:00 <gingeropolous> 850 million hashes per second * 50 watts per hash per second / 60 minutes = MWh?
17:07:08 <hyc> 50W per hash is insanely bad
17:07:35 <fort3hlulz> my 3700x is around 100h/w
17:07:58 <fort3hlulz> 7700h/s @ 77W total package draw
17:07:59 <tevador> stock Ryzen is 50 H/J, I would't say insanely bad
17:08:18 <hyc> please read again.
17:08:30 <hyc> 50W per hash, not 50 hash per W
17:09:21 <tevador> yeah, gingeropolous now you know the problem with your formula
17:09:40 <tevador> fort3hlulz: package power is a useless metric
17:10:05 <fort3hlulz> What's a more useful metric? W at the wall?
17:10:09 <fort3hlulz> Haven't captured that yet
17:10:11 <tevador> yes
17:11:02 <fort3hlulz> I've just been looking at tuning CPU/RAM variables to up efficiency (but haven't implemented any changes yet)
17:11:14 <fort3hlulz> So just using package power as a guide to efficiency of the CPU itself
17:13:35 <gingeropolous> ok, so 850e6 hashes/second * 1 joule / 50 hashes = 17 mega joule/sec = 17 MW
17:14:24 <tevador> yes, sounds about right
17:15:49 <hyc> was thinking earlier today, if we wrap randomx-benchmark with perf counter monitoring, it would be quite an excellent CPU+memory benchmark
17:16:17 <hyc> e.g. a simple wrapper script that runs a couple times while monitoring cache hit rates, branch rates, etc.
17:16:51 <hyc> should be simple with linux `perf`
17:17:50 <gingeropolous> and send shares to the dev fund? :P
17:18:06 <hyc> heh
17:22:27 <sech1> 17.9 kh/s on 3950X https://www.reddit.com/r/MoneroMining/comments/e9tuvd/randomx_boost_guide_for_ryzen_on_windows_9100_hs/fao4fyb/
17:22:52 <sech1> So it's almost 9000 h/s per memory channel
17:24:09 <hyc> no mention of memory timings?
17:25:21 <sech1> https://imgur.com/a/yDwlNvn
17:25:37 <sech1> from his earlier posts
17:25:49 <sech1> Ryzen 3950x @4GHz with 32GB memory @ 3600MHz timings 15-14-14-14-32-2T (as tight as I could get)
17:26:10 <sech1> and he was at 15.45 kh/s when he posted it 12 days ago
17:26:26 <hyc> 4.4GHz CPU sheesh
17:26:39 <hyc> y'think he could sustain that for over 24 hours?
17:26:44 <sech1> +15.8% on his system (same 4 GHz clock)
17:26:48 <sech1> no, he runs at 4 GHz now
17:27:05 <hyc> ah ok
17:27:27 <tevador> and the most important thing: power?
17:28:48 <sech1> it shouldn't be a lot at 4 GHz
17:29:46 <sech1> he posted https://imgur.com/a/9nHQDkO yesterday - it shows PPT "61% of 200W", so package power 122W
17:30:03 <sech1> but it was 3.875 GHz
17:30:40 <mortti> hello, I have been lurking the channel using monerologs.net but I'm the guy who posted to reddit with a nickname of mmrdx
17:32:26 <hyc> have you got measurements of power at the wall?
17:32:27 <mortti> my memory timings have been improved from the previous post to 14-14-14-14-28, the single rank b-die 3200 fast preset using dram calculator but applied to dual rank 3600
17:32:32 <sech1> hello mortti
17:32:56 <sech1> dual rank 3600 @ 14-14-14-28? This is insane
17:33:02 <hyc> very nice memory timings indeed
17:33:10 <mortti> yeah, it took 1.45v to get it stable
17:33:50 <sech1> Next thing is to lurk SuperPi benchers' forums and get an idea how they run 4000 @ 12-12-12-... :D
17:34:26 <hyc> 4000 @ 12 now *that* sounds insane
17:34:31 <mortti> I have lot of stuff drawing power on this workstation but at 3.85GHz I measured 270W at the socket
17:36:17 <hyc> puts it close to around 60H/W. sacrificed a lot of efficiency for that rate
17:36:58 <mortti> there is two radeon vegas, 8 fans, water cooling pump, etc. attached to this rig
17:37:21 <mortti> but I agree, efficiency is best at 3.6GHz or so
17:43:32 <sech1> hyc https://www.techpowerup.com/forums/threads/share-your-aida-64-cache-and-memory-benchmark-here.186338/page-25#post-3515562
17:43:46 <sech1> 3750 MHz 12-11-11-28 1.8V
17:45:03 <mortti> I was also able squeeze 5200H/s from sandy bridge xeon 1680v2 @4GHz with ddr3 :)
17:46:10 <gingeropolous> total miners 28601
17:49:13 <sech1> 15 kh/s at < 200W: https://bitcointalk.org/index.php?topic=5203616.msg53337979#msg53337979
19:16:09 <cohcho> What is the source of these numbers in RandomX/README.md: "* DDR4 memory is limited to about 4000-6000 H/s per channel"? How can i reproduce it?
19:20:38 <sech1> These are old numbers. It's over 9000 h/s per channel now.
19:22:13 <cohcho> The question is about the source and how to reproduce them?
19:22:47 <cohcho> I suppose it wasn't just measurement in real life experiments.
19:24:43 <sech1> https://i.imgur.com/DFeEqbP.jpg
19:25:02 <sech1> https://www.reddit.com/r/MoneroMining/comments/e7dm1p/amd_3950x_final_results_of_hashrate_after/fagozzk/
19:47:24 <cohcho> I saw it somewhere. Finally, found it in RandomX/doc/design.md.
19:49:29 <cohcho> The formula is : 1 second / ( DRAM single bank read latency 40 nanosecond) / (16384 reads from dataset / 1 hash) * (banks in 1 raw module 8) = 1 / 40e-9 / 16384 * 8 ~= 12207 H/s
19:49:54 <cohcho> banks in 1 RAM module *
19:50:25 <cohcho> So it isn't 6000 even theoretically.
19:51:35 <tevador> cohcho: you are wrong about 8 banks per module
19:51:54 <cohcho> Let me check doc about any real world ram again
19:52:15 <tevador> it's bank GROUPS, not banks
19:52:28 <tevador> DDR3 has only 1 bank group per channel
19:52:32 <tevador> DDR4 has 2-4
19:54:57 <tevador> it doesn't matter how many banks the group has because each group shares the same control circuitry, so there can be only 1 pending read per group
19:56:40 <tevador> "The DDR4 architecture is an 8n prefetch with two or four selectable bank groups. This design will permit the DDR4 memory devices to have separate activation, read, write or refresh operations underway in each unique bank group."
20:17:40 <cohcho> I still need some proof that this fundamental access latency limit 50ns is unbreakable for real applications, rereading this paper : http://prof.icc.skku.ac.kr/~jaewlee/pubs/isca13_charm.pdf
20:23:15 <tevador> as you can see, it's not a hard limit, it can be improved with tighter memory timings
20:23:21 <tevador> but there IS a limit
20:27:03 <sech1> Intel CPUs can have DRAM latency as low as 40 ns
21:10:09 <sech1> One thing keeps me wondering... With MSR mod everyon get 5% or more speed increase AND power reduction of ~5 watts
21:10:33 <sech1> So what's the actual power spent on calculations and what's the power spent on memory access/unrelated stuff?
21:11:32 <moneromooo> Did you unset the "NSA_BACKGROUND_MONITORING" bit ? :)
21:12:09 <sech1> maybe, all changed bits are undocumented for Ryzens :D
21:13:17 <sech1> or was it just hardware prefetchers that worked all the time, consuming power to reduce hashrate :D
21:13:26 <tevador> IIRC I measured something like 10+ watts just from memory accesses
21:13:39 <tevador> you can benchmark it by removing dataset accesses
21:14:03 <sech1> now the next thing is to find MSR to turn off branch prediction, right?
21:14:18 <sech1> I'm sure they keep it hidden somewhere...
21:14:30 <tevador> it's not necessary to disable it
21:14:55 <tevador> it would make sense only if we had many 50/50 branches
21:15:55 <tevador> anyways, hopefully the next gen CPUs will have a "RandomX" MSR
21:16:01 <tevador> :P
21:16:23 <sech1> Well, BIOS manufactures can add it to their "Performance bias" options list
21:16:38 <sech1> They have it for Geekbench/Cinebench
21:16:53 <sech1> and they have NDA docs to optimize it further
21:17:05 <sech1> I can already see "mining BIOS" beta versions
21:18:08 <moneromooo> Hmm. That'd be an idea. Ask them "what is the set of MSR values that maximizes hash/watt". They don't have to document stuff, and their CPUs look better. Though I suppose the values could be pretty exact-cpu-version dependent.
21:18:56 <sech1> as far as I can see, AMD tries to keep their MSR internals consistent within the same CPU family
21:19:07 <hyc> the description of the prefetchers says it looks tries to detect strided access patterns
21:19:13 <sech1> the big change was between Bulldozer and Ryzen family
21:19:29 <hyc> would assume when it detects no such pattern, it does nothing.
21:20:01 <sech1> working prefetcher consumes power and trigger sometimes, doing unneeded memory requests
21:20:03 <hyc> but if toggling the MSR affects it so much, that implies either that it's seeing false access patterns in RandomX's accesses, or it's prefetching all the time even when it doesn't know what to prefetch
21:20:08 <sech1> so lower hashrate and higher power
21:20:25 <hyc> seems like quite a high power cost.
21:20:45 <hyc> for what is essentially just maintaining a list of addresses
21:20:48 <hyc> address offsets.
21:23:04 <tevador> I think the prefetchers are pretty dumb, they probably load at least one cache lines directly after the one read from the dataset
21:23:39 <tevador> it's even called "stream prefetcher"
21:23:58 <hyc> hm yeah that'd make more sense
21:24:55 <tevador> that would have the effect of doubling the bandwidth usage and slightly increasing the latency, which could account for the 5% perf and 5W power diff
21:25:00 <hyc> Overall it makes sense. I have the analagous setting in LMDB to disable OS prefetching when datasets are larger than RAM
21:26:34 <hyc> by default, when the kernel is forced to read a page of data, it prefetches 16 pages at a time. if RAM is full, the extra pages may evict useful data from cache and bring in useless data.
21:27:13 <hyc> so as a rule, prefetch is a good idea for sequential access patterns, and horrible otherwise.
21:28:16 <cohcho> Is it possible to write randomx dataset into filesystem file and open read only shared mmap with restriction 2GiB only?
21:28:27 <cohcho> for devices that don't have 2GiB + 80MiB of free RAM?
21:28:42 <hyc> possible, yes. performance will suck.
21:29:10 <cohcho> I need exact numbers comparing to cache only
21:29:18 <cohcho> It should be better but how much
21:29:21 <cohcho> I don't know
21:29:35 <hyc> you should be able to find a suitable device for testing on
21:29:58 <hyc> page faults will take at least 30ms for a typical HDD
21:29:59 <cohcho> How to limit mmap size in linux?
21:30:42 <hyc> you don't need to limit mmap size, you're going to mmap exactly the amount of RAM you would normally require.
21:31:09 <hyc> the trick is to simply eat up the rest of RAM until you've shrunk free RAM down to the desired size.
21:31:18 <cohcho> I've asked since process being killed after the frist try to initialize dataset.
21:31:29 <hyc> you can do this just by writing a program that mmaps the rest of RAM privately, and mlocks it all
21:31:56 <hyc> do you have any swap space configured?
21:32:03 <cohcho> No
21:32:13 <hyc> note that adding sufficient swap should give you the same performance as a read-only mmap'd file in this use case
21:32:24 <cohcho> With c++ based implementation of randomx It is hard to free ram for other things except dataset
21:32:47 <cohcho> I agree with you about not using c++
21:32:51 <hyc> you don't need to alter the randomx code at all
21:33:19 <hyc> anyway, I have tested on my ARM64 boxes with only 2GB of RAM, and swap enabled on a microSD card.
21:33:24 <hyc> it is quite painfully slow.
21:33:36 <cohcho> There is no way to setup swap
21:34:23 <tevador> yeah, you will get about 0.01 H/s
21:34:41 <tevador> I think gingeropolous tested it before
21:35:08 <tevador> it's much faster to use the 256 MB mode
21:35:12 <cohcho> Another implementation is required in order have hybrid dataset : partially in-memory, partially recomputed on the fly
21:35:24 <cohcho> But it requires some time to implement
21:35:58 <hyc> Yeah I guess that could be worthwhile. for 50% of dataset, rest using 256MB cache.
21:36:09 <hyc> for the accesses that land in the dataset memory, you win.
21:36:09 <tevador> the benefit of this will be smaller than you think
21:36:25 <sech1> 50% dataset will be not even 2 times faster than the light mode
21:36:27 <tevador> 50% dataset still means a 4x slowdown
21:37:51 <cohcho> replace 50% with configurable  parameter that should be set in runtime to free ram
21:37:57 <cohcho> It will be in practise close to 2GiB
21:37:59 <cohcho> not 50%
21:38:11 <hyc> 4x slowdown is still faster than light mode 6x slowdown
21:44:34 <sech1> I've tested single core performance with MSR mod and got 1287 h/s, so 10296 h/s is possible without memory bottleneck
21:45:03 <sech1> which means real-life 9670 h/s is still 6% slower than it can do in theory
21:51:01 <cohcho> I don't like the fact that these upper boundary limits is changing faster than I expect
21:54:29 <hyc> why does it matter to you?
21:54:37 <tevador> it's only a theoretical limit, I don't think everything will scale linearly to all cores