#monero-pow

03:38

gingeropolous

which mrmsr for opteron?
03:46

hyc

I see servethehome already has some randomx discussion forums.servethehome.com/index.php?t…o-changed-to-the-randomx-algo.26589
03:47

hyc

gingeropolous: who knows if the opteron even had an aggressive prefetcher to disable in the first place
03:47

hyc

AMD tended to lag behind on that compared to Intel
03:48

hyc

in either case the prefetchers are designed to recognize repetitive access patterns. a simple orefetcher may only recognize purely sequential accesses.
03:48

hyc

if it doesn't recognize the access pattern, it goes inactive
03:49

hyc

for a highly random access pattern like randomX, it ought to deactivate itself
03:50

hyc

I guess Bulldozer generation was improved realworldtech.com/bulldozer/9
03:53

hyc

oh here you go amd.com/en/support/tech-docs
03:53

hyc

Volume 2: System Programming, Appendix A
03:53

hyc

has listing of all MSRs
03:57

hyc

well, not all. nothing about prefetchers
04:00

hyc

Maybe you'll find it in the BIOS and Kernel Dev's GUide for your processor family
05:04

gingeropolous

this bios is useless.
05:04

gingeropolous

i'll be happy with 2.5 kh/s per 10 year old processor
05:34

gingeropolous

total miners 28316
06:23

Inge-

oh look. MoneroV lives! In a Frankenstein-meets-living-dead sort of way. And guess what, they are moving to RandomV - 'a derivative' of RandomX - monerov.online/MoneroV-Swap.pdf
06:24

sech1

RandomV = RandomX, I checked their changes
06:24

sech1

They only changed ArgonSalt parameter
06:25

Inge-

That would be my assumption. You get a notification when it gets forked on github?
06:25

sech1

No, I found them in twitter when I looked for RandomX hashtag
06:26

Inge-

Lol
06:27

Inge-

I'm still waiting for them to blow me away with their mimblewimble implementation
06:36

sech1

Hmm, people are reporting both increased hashrate _and_ lower power after MSR mode, nice!
09:44

Inge-

I wonder if Arweave will work your optimizations in
10:42

wow-discord

<wowario> lol, it is seb heading up MoneroV Reloaded
10:54

Inge-

who is seb?
10:54

Inge-

Antony is the most active TG user in their group
11:00

wow-discord

<wowario> a shady multi-shitcoin pool operator
11:03

Inge-

I'm torn. I kind of feel pity for people having put money into XMV, but on the other hand it is such an obvious scam that I also kind of don't ...
17:03

gingeropolous

ugh. how do we do that power calculation for nethash
17:04

gingeropolous

850 million hashes per second * 50 watts per hash per second / 60 minutes = MWh?
17:07

hyc

50W per hash is insanely bad
17:07

fort3hlulz

my 3700x is around 100h/w
17:07

fort3hlulz

7700h/s @ 77W total package draw
17:07

tevador

stock Ryzen is 50 H/J, I would't say insanely bad
17:08

hyc

please read again.
17:08

hyc

50W per hash, not 50 hash per W
17:09

tevador

yeah, gingeropolous now you know the problem with your formula
17:09

tevador

fort3hlulz: package power is a useless metric
17:10

fort3hlulz

What's a more useful metric? W at the wall?
17:10

fort3hlulz

Haven't captured that yet
17:10

tevador

yes
17:11

fort3hlulz

I've just been looking at tuning CPU/RAM variables to up efficiency (but haven't implemented any changes yet)
17:11

fort3hlulz

So just using package power as a guide to efficiency of the CPU itself
17:13

gingeropolous

ok, so 850e6 hashes/second * 1 joule / 50 hashes = 17 mega joule/sec = 17 MW
17:14

tevador

yes, sounds about right
17:15

hyc

was thinking earlier today, if we wrap randomx-benchmark with perf counter monitoring, it would be quite an excellent CPU+memory benchmark
17:16

hyc

e.g. a simple wrapper script that runs a couple times while monitoring cache hit rates, branch rates, etc.
17:16

hyc

should be simple with linux `perf`
17:17

gingeropolous

and send shares to the dev fund? :P
17:18

hyc

heh
17:22

sech1

17.9 kh/s on 3950X reddit.com/r/MoneroMining/comments/…or_ryzen_on_windows_9100_hs/fao4fyb
17:22

sech1

So it's almost 9000 h/s per memory channel
17:24

hyc

no mention of memory timings?
17:25

sech1

imgur.com/a/yDwlNvn
17:25

sech1

from his earlier posts
17:25

sech1

Ryzen 3950x @4GHz with 32GB memory @ 3600MHz timings 15-14-14-14-32-2T (as tight as I could get)
17:26

sech1

and he was at 15.45 kh/s when he posted it 12 days ago
17:26

hyc

4.4GHz CPU sheesh
17:26

hyc

y'think he could sustain that for over 24 hours?
17:26

sech1

+15.8% on his system (same 4 GHz clock)
17:26

sech1

no, he runs at 4 GHz now
17:27

hyc

ah ok
17:27

tevador

and the most important thing: power?
17:28

sech1

it shouldn't be a lot at 4 GHz
17:29

sech1

he posted imgur.com/a/9nHQDkO yesterday - it shows PPT "61% of 200W", so package power 122W
17:30

sech1

but it was 3.875 GHz
17:30

mortti

hello, I have been lurking the channel using monerologs.net but I'm the guy who posted to reddit with a nickname of mmrdx
17:32

hyc

have you got measurements of power at the wall?
17:32

mortti

my memory timings have been improved from the previous post to 14-14-14-14-28, the single rank b-die 3200 fast preset using dram calculator but applied to dual rank 3600
17:32

sech1

hello mortti
17:32

sech1

dual rank 3600 @ 14-14-14-28? This is insane
17:33

hyc

very nice memory timings indeed
17:33

mortti

yeah, it took 1.45v to get it stable
17:33

sech1

Next thing is to lurk SuperPi benchers' forums and get an idea how they run 4000 @ 12-12-12-... :D
17:34

hyc

4000 @ 12 now *that* sounds insane
17:34

mortti

I have lot of stuff drawing power on this workstation but at 3.85GHz I measured 270W at the socket
17:36

hyc

puts it close to around 60H/W. sacrificed a lot of efficiency for that rate
17:36

mortti

there is two radeon vegas, 8 fans, water cooling pump, etc. attached to this rig
17:37

mortti

but I agree, efficiency is best at 3.6GHz or so
17:43

sech1

hyc techpowerup.com/forums/threads/shar…rk-here.186338/page-25#post-3515562
17:43

sech1

3750 MHz 12-11-11-28 1.8V
17:45

mortti

I was also able squeeze 5200H/s from sandy bridge xeon 1680v2 @4GHz with ddr3 :)
17:46

gingeropolous

total miners 28601
17:49

sech1

15 kh/s at < 200W: bitcointalk.org/index.php?topic=5203616.msg53337979#msg53337979
19:16

cohcho

What is the source of these numbers in RandomX/README.md: "* DDR4 memory is limited to about 4000-6000 H/s per channel"? How can i reproduce it?
19:20

sech1

These are old numbers. It's over 9000 h/s per channel now.
19:22

cohcho

The question is about the source and how to reproduce them?
19:22

cohcho

I suppose it wasn't just measurement in real life experiments.
19:24

sech1

i.imgur.com/DFeEqbP.jpg
19:25

sech1

reddit.com/r/MoneroMining/comments/…l_results_of_hashrate_after/fagozzk
19:47

cohcho

I saw it somewhere. Finally, found it in RandomX/doc/design.md.
19:49

cohcho

The formula is : 1 second / ( DRAM single bank read latency 40 nanosecond) / (16384 reads from dataset / 1 hash) * (banks in 1 raw module 8) = 1 / 40e-9 / 16384 * 8 ~= 12207 H/s
19:49

cohcho

banks in 1 RAM module *
19:50

cohcho

So it isn't 6000 even theoretically.
19:51

tevador

cohcho: you are wrong about 8 banks per module
19:51

cohcho

Let me check doc about any real world ram again
19:52

tevador

it's bank GROUPS, not banks
19:52

tevador

DDR3 has only 1 bank group per channel
19:52

tevador

DDR4 has 2-4
19:54

tevador

it doesn't matter how many banks the group has because each group shares the same control circuitry, so there can be only 1 pending read per group
19:56

tevador

"The DDR4 architecture is an 8n prefetch with two or four selectable bank groups. This design will permit the DDR4 memory devices to have separate activation, read, write or refresh operations underway in each unique bank group."
20:17

cohcho

I still need some proof that this fundamental access latency limit 50ns is unbreakable for real applications, rereading this paper : prof.icc.skku.ac.kr/~jaewlee/pubs/isca13_charm.pdf
20:23

tevador

as you can see, it's not a hard limit, it can be improved with tighter memory timings
20:23

tevador

but there IS a limit
20:27

sech1

Intel CPUs can have DRAM latency as low as 40 ns
21:10

sech1

One thing keeps me wondering... With MSR mod everyon get 5% or more speed increase AND power reduction of ~5 watts
21:10

sech1

So what's the actual power spent on calculations and what's the power spent on memory access/unrelated stuff?
21:11

moneromooo

Did you unset the "NSA_BACKGROUND_MONITORING" bit ? :)
21:12

sech1

maybe, all changed bits are undocumented for Ryzens :D
21:13

sech1

or was it just hardware prefetchers that worked all the time, consuming power to reduce hashrate :D
21:13

tevador

IIRC I measured something like 10+ watts just from memory accesses
21:13

tevador

you can benchmark it by removing dataset accesses
21:14

sech1

now the next thing is to find MSR to turn off branch prediction, right?
21:14

sech1

I'm sure they keep it hidden somewhere...
21:14

tevador

it's not necessary to disable it
21:14

tevador

it would make sense only if we had many 50/50 branches
21:15

tevador

anyways, hopefully the next gen CPUs will have a "RandomX" MSR
21:16

tevador

:P
21:16

sech1

Well, BIOS manufactures can add it to their "Performance bias" options list
21:16

sech1

They have it for Geekbench/Cinebench
21:16

sech1

and they have NDA docs to optimize it further
21:17

sech1

I can already see "mining BIOS" beta versions
21:18

moneromooo

Hmm. That'd be an idea. Ask them "what is the set of MSR values that maximizes hash/watt". They don't have to document stuff, and their CPUs look better. Though I suppose the values could be pretty exact-cpu-version dependent.
21:18

sech1

as far as I can see, AMD tries to keep their MSR internals consistent within the same CPU family
21:19

hyc

the description of the prefetchers says it looks tries to detect strided access patterns
21:19

sech1

the big change was between Bulldozer and Ryzen family
21:19

hyc

would assume when it detects no such pattern, it does nothing.
21:20

sech1

working prefetcher consumes power and trigger sometimes, doing unneeded memory requests
21:20

hyc

but if toggling the MSR affects it so much, that implies either that it's seeing false access patterns in RandomX's accesses, or it's prefetching all the time even when it doesn't know what to prefetch
21:20

sech1

so lower hashrate and higher power
21:20

hyc

seems like quite a high power cost.
21:20

hyc

for what is essentially just maintaining a list of addresses
21:20

hyc

address offsets.
21:23

tevador

I think the prefetchers are pretty dumb, they probably load at least one cache lines directly after the one read from the dataset
21:23

tevador

it's even called "stream prefetcher"
21:23

hyc

hm yeah that'd make more sense
21:24

tevador

that would have the effect of doubling the bandwidth usage and slightly increasing the latency, which could account for the 5% perf and 5W power diff
21:25

hyc

Overall it makes sense. I have the analagous setting in LMDB to disable OS prefetching when datasets are larger than RAM
21:26

hyc

by default, when the kernel is forced to read a page of data, it prefetches 16 pages at a time. if RAM is full, the extra pages may evict useful data from cache and bring in useless data.
21:27

hyc

so as a rule, prefetch is a good idea for sequential access patterns, and horrible otherwise.
21:28

cohcho

Is it possible to write randomx dataset into filesystem file and open read only shared mmap with restriction 2GiB only?
21:28

cohcho

for devices that don't have 2GiB + 80MiB of free RAM?
21:28

hyc

possible, yes. performance will suck.
21:29

cohcho

I need exact numbers comparing to cache only
21:29

cohcho

It should be better but how much
21:29

cohcho

I don't know
21:29

hyc

you should be able to find a suitable device for testing on
21:29

hyc

page faults will take at least 30ms for a typical HDD
21:29

cohcho

How to limit mmap size in linux?
21:30

hyc

you don't need to limit mmap size, you're going to mmap exactly the amount of RAM you would normally require.
21:31

hyc

the trick is to simply eat up the rest of RAM until you've shrunk free RAM down to the desired size.
21:31

cohcho

I've asked since process being killed after the frist try to initialize dataset.
21:31

hyc

you can do this just by writing a program that mmaps the rest of RAM privately, and mlocks it all
21:31

hyc

do you have any swap space configured?
21:32

cohcho

No
21:32

hyc

note that adding sufficient swap should give you the same performance as a read-only mmap'd file in this use case
21:32

cohcho

With c++ based implementation of randomx It is hard to free ram for other things except dataset
21:32

cohcho

I agree with you about not using c++
21:32

hyc

you don't need to alter the randomx code at all
21:33

hyc

anyway, I have tested on my ARM64 boxes with only 2GB of RAM, and swap enabled on a microSD card.
21:33

hyc

it is quite painfully slow.
21:33

cohcho

There is no way to setup swap
21:34

tevador

yeah, you will get about 0.01 H/s
21:34

tevador

I think gingeropolous tested it before
21:35

tevador

it's much faster to use the 256 MB mode
21:35

cohcho

Another implementation is required in order have hybrid dataset : partially in-memory, partially recomputed on the fly
21:35

cohcho

But it requires some time to implement
21:35

hyc

Yeah I guess that could be worthwhile. for 50% of dataset, rest using 256MB cache.
21:36

hyc

for the accesses that land in the dataset memory, you win.
21:36

tevador

the benefit of this will be smaller than you think
21:36

sech1

50% dataset will be not even 2 times faster than the light mode
21:36

tevador

50% dataset still means a 4x slowdown
21:37

cohcho

replace 50% with configurable parameter that should be set in runtime to free ram
21:37

cohcho

It will be in practise close to 2GiB
21:37

cohcho

not 50%
21:38

hyc

4x slowdown is still faster than light mode 6x slowdown
21:44

sech1

I've tested single core performance with MSR mod and got 1287 h/s, so 10296 h/s is possible without memory bottleneck
21:45

sech1

which means real-life 9670 h/s is still 6% slower than it can do in theory
21:51

cohcho

I don't like the fact that these upper boundary limits is changing faster than I expect
21:54

hyc

why does it matter to you?
21:54

tevador

it's only a theoretical limit, I don't think everything will scale linearly to all cores

6 years ago

« a day earlier

a day later »

today »