-
gingeropolous
which mrmsr for opteron?
-
hyc
-
hyc
gingeropolous: who knows if the opteron even had an aggressive prefetcher to disable in the first place
-
hyc
AMD tended to lag behind on that compared to Intel
-
hyc
in either case the prefetchers are designed to recognize repetitive access patterns. a simple orefetcher may only recognize purely sequential accesses.
-
hyc
if it doesn't recognize the access pattern, it goes inactive
-
hyc
for a highly random access pattern like randomX, it ought to deactivate itself
-
hyc
I guess Bulldozer generation was improved
realworldtech.com/bulldozer/9
-
hyc
-
hyc
Volume 2: System Programming, Appendix A
-
hyc
has listing of all MSRs
-
hyc
well, not all. nothing about prefetchers
-
hyc
Maybe you'll find it in the BIOS and Kernel Dev's GUide for your processor family
-
gingeropolous
this bios is useless.
-
gingeropolous
i'll be happy with 2.5 kh/s per 10 year old processor
-
gingeropolous
total miners 28316
-
Inge-
oh look. MoneroV lives! In a Frankenstein-meets-living-dead sort of way. And guess what, they are moving to RandomV - 'a derivative' of RandomX -
monerov.online/MoneroV-Swap.pdf
-
sech1
RandomV = RandomX, I checked their changes
-
sech1
They only changed ArgonSalt parameter
-
Inge-
That would be my assumption. You get a notification when it gets forked on github?
-
sech1
No, I found them in twitter when I looked for RandomX hashtag
-
Inge-
Lol
-
Inge-
I'm still waiting for them to blow me away with their mimblewimble implementation
-
sech1
Hmm, people are reporting both increased hashrate _and_ lower power after MSR mode, nice!
-
Inge-
I wonder if Arweave will work your optimizations in
-
wow-discord
<wowario> lol, it is seb heading up MoneroV Reloaded
-
Inge-
who is seb?
-
Inge-
Antony is the most active TG user in their group
-
wow-discord
<wowario> a shady multi-shitcoin pool operator
-
Inge-
I'm torn. I kind of feel pity for people having put money into XMV, but on the other hand it is such an obvious scam that I also kind of don't ...
-
gingeropolous
ugh. how do we do that power calculation for nethash
-
gingeropolous
850 million hashes per second * 50 watts per hash per second / 60 minutes = MWh?
-
hyc
50W per hash is insanely bad
-
fort3hlulz
my 3700x is around 100h/w
-
fort3hlulz
7700h/s @ 77W total package draw
-
tevador
stock Ryzen is 50 H/J, I would't say insanely bad
-
hyc
please read again.
-
hyc
50W per hash, not 50 hash per W
-
tevador
yeah, gingeropolous now you know the problem with your formula
-
tevador
fort3hlulz: package power is a useless metric
-
fort3hlulz
What's a more useful metric? W at the wall?
-
fort3hlulz
Haven't captured that yet
-
tevador
yes
-
fort3hlulz
I've just been looking at tuning CPU/RAM variables to up efficiency (but haven't implemented any changes yet)
-
fort3hlulz
So just using package power as a guide to efficiency of the CPU itself
-
gingeropolous
ok, so 850e6 hashes/second * 1 joule / 50 hashes = 17 mega joule/sec = 17 MW
-
tevador
yes, sounds about right
-
hyc
was thinking earlier today, if we wrap randomx-benchmark with perf counter monitoring, it would be quite an excellent CPU+memory benchmark
-
hyc
e.g. a simple wrapper script that runs a couple times while monitoring cache hit rates, branch rates, etc.
-
hyc
should be simple with linux `perf`
-
gingeropolous
and send shares to the dev fund? :P
-
hyc
heh
-
sech1
-
sech1
So it's almost 9000 h/s per memory channel
-
hyc
no mention of memory timings?
-
sech1
-
sech1
from his earlier posts
-
sech1
Ryzen 3950x @4GHz with 32GB memory @ 3600MHz timings 15-14-14-14-32-2T (as tight as I could get)
-
sech1
and he was at 15.45 kh/s when he posted it 12 days ago
-
hyc
4.4GHz CPU sheesh
-
hyc
y'think he could sustain that for over 24 hours?
-
sech1
+15.8% on his system (same 4 GHz clock)
-
sech1
no, he runs at 4 GHz now
-
hyc
ah ok
-
tevador
and the most important thing: power?
-
sech1
it shouldn't be a lot at 4 GHz
-
sech1
he posted
imgur.com/a/9nHQDkO yesterday - it shows PPT "61% of 200W", so package power 122W
-
sech1
but it was 3.875 GHz
-
mortti
hello, I have been lurking the channel using monerologs.net but I'm the guy who posted to reddit with a nickname of mmrdx
-
hyc
have you got measurements of power at the wall?
-
mortti
my memory timings have been improved from the previous post to 14-14-14-14-28, the single rank b-die 3200 fast preset using dram calculator but applied to dual rank 3600
-
sech1
hello mortti
-
sech1
dual rank 3600 @ 14-14-14-28? This is insane
-
hyc
very nice memory timings indeed
-
mortti
yeah, it took 1.45v to get it stable
-
sech1
Next thing is to lurk SuperPi benchers' forums and get an idea how they run 4000 @ 12-12-12-... :D
-
hyc
4000 @ 12 now *that* sounds insane
-
mortti
I have lot of stuff drawing power on this workstation but at 3.85GHz I measured 270W at the socket
-
hyc
puts it close to around 60H/W. sacrificed a lot of efficiency for that rate
-
mortti
there is two radeon vegas, 8 fans, water cooling pump, etc. attached to this rig
-
mortti
but I agree, efficiency is best at 3.6GHz or so
-
sech1
-
sech1
3750 MHz 12-11-11-28 1.8V
-
mortti
I was also able squeeze 5200H/s from sandy bridge xeon 1680v2 @4GHz with ddr3 :)
-
gingeropolous
total miners 28601
-
sech1
-
cohcho
What is the source of these numbers in RandomX/README.md: "* DDR4 memory is limited to about 4000-6000 H/s per channel"? How can i reproduce it?
-
sech1
These are old numbers. It's over 9000 h/s per channel now.
-
cohcho
The question is about the source and how to reproduce them?
-
cohcho
I suppose it wasn't just measurement in real life experiments.
-
sech1
-
sech1
-
cohcho
I saw it somewhere. Finally, found it in RandomX/doc/design.md.
-
cohcho
The formula is : 1 second / ( DRAM single bank read latency 40 nanosecond) / (16384 reads from dataset / 1 hash) * (banks in 1 raw module 8) = 1 / 40e-9 / 16384 * 8 ~= 12207 H/s
-
cohcho
banks in 1 RAM module *
-
cohcho
So it isn't 6000 even theoretically.
-
tevador
cohcho: you are wrong about 8 banks per module
-
cohcho
Let me check doc about any real world ram again
-
tevador
it's bank GROUPS, not banks
-
tevador
DDR3 has only 1 bank group per channel
-
tevador
DDR4 has 2-4
-
tevador
it doesn't matter how many banks the group has because each group shares the same control circuitry, so there can be only 1 pending read per group
-
tevador
"The DDR4 architecture is an 8n prefetch with two or four selectable bank groups. This design will permit the DDR4 memory devices to have separate activation, read, write or refresh operations underway in each unique bank group."
-
cohcho
I still need some proof that this fundamental access latency limit 50ns is unbreakable for real applications, rereading this paper :
prof.icc.skku.ac.kr/~jaewlee/pubs/isca13_charm.pdf
-
tevador
as you can see, it's not a hard limit, it can be improved with tighter memory timings
-
tevador
but there IS a limit
-
sech1
Intel CPUs can have DRAM latency as low as 40 ns
-
sech1
One thing keeps me wondering... With MSR mod everyon get 5% or more speed increase AND power reduction of ~5 watts
-
sech1
So what's the actual power spent on calculations and what's the power spent on memory access/unrelated stuff?
-
moneromooo
Did you unset the "NSA_BACKGROUND_MONITORING" bit ? :)
-
sech1
maybe, all changed bits are undocumented for Ryzens :D
-
sech1
or was it just hardware prefetchers that worked all the time, consuming power to reduce hashrate :D
-
tevador
IIRC I measured something like 10+ watts just from memory accesses
-
tevador
you can benchmark it by removing dataset accesses
-
sech1
now the next thing is to find MSR to turn off branch prediction, right?
-
sech1
I'm sure they keep it hidden somewhere...
-
tevador
it's not necessary to disable it
-
tevador
it would make sense only if we had many 50/50 branches
-
tevador
anyways, hopefully the next gen CPUs will have a "RandomX" MSR
-
tevador
:P
-
sech1
Well, BIOS manufactures can add it to their "Performance bias" options list
-
sech1
They have it for Geekbench/Cinebench
-
sech1
and they have NDA docs to optimize it further
-
sech1
I can already see "mining BIOS" beta versions
-
moneromooo
Hmm. That'd be an idea. Ask them "what is the set of MSR values that maximizes hash/watt". They don't have to document stuff, and their CPUs look better. Though I suppose the values could be pretty exact-cpu-version dependent.
-
sech1
as far as I can see, AMD tries to keep their MSR internals consistent within the same CPU family
-
hyc
the description of the prefetchers says it looks tries to detect strided access patterns
-
sech1
the big change was between Bulldozer and Ryzen family
-
hyc
would assume when it detects no such pattern, it does nothing.
-
sech1
working prefetcher consumes power and trigger sometimes, doing unneeded memory requests
-
hyc
but if toggling the MSR affects it so much, that implies either that it's seeing false access patterns in RandomX's accesses, or it's prefetching all the time even when it doesn't know what to prefetch
-
sech1
so lower hashrate and higher power
-
hyc
seems like quite a high power cost.
-
hyc
for what is essentially just maintaining a list of addresses
-
hyc
address offsets.
-
tevador
I think the prefetchers are pretty dumb, they probably load at least one cache lines directly after the one read from the dataset
-
tevador
it's even called "stream prefetcher"
-
hyc
hm yeah that'd make more sense
-
tevador
that would have the effect of doubling the bandwidth usage and slightly increasing the latency, which could account for the 5% perf and 5W power diff
-
hyc
Overall it makes sense. I have the analagous setting in LMDB to disable OS prefetching when datasets are larger than RAM
-
hyc
by default, when the kernel is forced to read a page of data, it prefetches 16 pages at a time. if RAM is full, the extra pages may evict useful data from cache and bring in useless data.
-
hyc
so as a rule, prefetch is a good idea for sequential access patterns, and horrible otherwise.
-
cohcho
Is it possible to write randomx dataset into filesystem file and open read only shared mmap with restriction 2GiB only?
-
cohcho
for devices that don't have 2GiB + 80MiB of free RAM?
-
hyc
possible, yes. performance will suck.
-
cohcho
I need exact numbers comparing to cache only
-
cohcho
It should be better but how much
-
cohcho
I don't know
-
hyc
you should be able to find a suitable device for testing on
-
hyc
page faults will take at least 30ms for a typical HDD
-
cohcho
How to limit mmap size in linux?
-
hyc
you don't need to limit mmap size, you're going to mmap exactly the amount of RAM you would normally require.
-
hyc
the trick is to simply eat up the rest of RAM until you've shrunk free RAM down to the desired size.
-
cohcho
I've asked since process being killed after the frist try to initialize dataset.
-
hyc
you can do this just by writing a program that mmaps the rest of RAM privately, and mlocks it all
-
hyc
do you have any swap space configured?
-
cohcho
No
-
hyc
note that adding sufficient swap should give you the same performance as a read-only mmap'd file in this use case
-
cohcho
With c++ based implementation of randomx It is hard to free ram for other things except dataset
-
cohcho
I agree with you about not using c++
-
hyc
you don't need to alter the randomx code at all
-
hyc
anyway, I have tested on my ARM64 boxes with only 2GB of RAM, and swap enabled on a microSD card.
-
hyc
it is quite painfully slow.
-
cohcho
There is no way to setup swap
-
tevador
yeah, you will get about 0.01 H/s
-
tevador
I think gingeropolous tested it before
-
tevador
it's much faster to use the 256 MB mode
-
cohcho
Another implementation is required in order have hybrid dataset : partially in-memory, partially recomputed on the fly
-
cohcho
But it requires some time to implement
-
hyc
Yeah I guess that could be worthwhile. for 50% of dataset, rest using 256MB cache.
-
hyc
for the accesses that land in the dataset memory, you win.
-
tevador
the benefit of this will be smaller than you think
-
sech1
50% dataset will be not even 2 times faster than the light mode
-
tevador
50% dataset still means a 4x slowdown
-
cohcho
replace 50% with configurable parameter that should be set in runtime to free ram
-
cohcho
It will be in practise close to 2GiB
-
cohcho
not 50%
-
hyc
4x slowdown is still faster than light mode 6x slowdown
-
sech1
I've tested single core performance with MSR mod and got 1287 h/s, so 10296 h/s is possible without memory bottleneck
-
sech1
which means real-life 9670 h/s is still 6% slower than it can do in theory
-
cohcho
I don't like the fact that these upper boundary limits is changing faster than I expect
-
hyc
why does it matter to you?
-
tevador
it's only a theoretical limit, I don't think everything will scale linearly to all cores