8000 [LA64_DYNAREC] Added more 660F opcodes by xiangzhai · Pull Request #2127 · ptitSeb/box64 · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[LA64_DYNAREC] Added more 660F opcodes #2127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 10, 2024

Conversation

xiangzhai
Copy link
Contributor

Hi,

dav1d benchmark:

./dav1d -i Chimera-AV1-8bit-480x270-552kbps.ivf --muxer null

before:

Decoded 8929/8929 frames (100.0%) - 3.67/23.98 fps (0.15x)

real	40m35.251s
user	40m20.151s

after:

Decoded 8929/8929 frames (100.0%) - 89.06/23.98 fps (3.71x)

real	1m40.299s
user	1m39.578s

Please review my patch.

Thanks,
Leslie Zhai

@xiangzhai xiangzhai marked this pull request as draft December 9, 2024 06:11
@xiangzhai
Copy link
Contributor Author

Sorry for introduced regression: https://github.com/ptitSeb/box64/actions/runs/12229956103/job/34110518062?pr=2127

I am checking...

Thanks,
Leslie Zhai

@ksco
Copy link
Collaborator
ksco commented Dec 9, 2024

Change title to Added more 660F opcodes or just Added more opcodes.

@ksco
Copy link
Collaborator
ksco commented Dec 9, 2024

Note I won't be able to review this PR at least for today, so take your time to fix the bugs ;)

@xiangzhai xiangzhai changed the title [LA64_DYNAREC] Added PMULHRSW, PABSB, PABSW, PACKUSDW, PMINUW, PMAXSD, PMULLD, PBLENDW, PSRLW, PSUBUSB, PMINUB, PADDUSW, PMAXUB, PSRAD, PSUBSB, PSUBSW, PMINSW, PADDSB, PADDSW, PMAXSW and PMADDWD opcodes [LA64_DYNAREC] Added more 660F opcodes Dec 9, 2024
@xiangzhai xiangzhai marked this pull request as ready for review December 9, 2024 06:58
GETEX(q1, 0, 0);
v0 = fpu_get_scratch(dyn);
v1 = fpu_get_scratch(dyn);
VMULWEV_W_H(v0, q0, q1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, this seems a bit too much of instructions to implement this, maybe xvexth.w.h is helpful here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LASX is there, so it's a waste to have the upper 128bit does nothing :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EV in VMULWEV_W_H means Even digit and OD means Odd digit, so there might not be a chance to move v1 into v0's upper 128bit:

VMULWEV_W_H(v0, q0, q1);
VMULWOD_W_H(v1, q0, q1);
XVEXTH_W_H(v0, v1); 
XVSRAI_W(v0, v0, 14); 
XVADDI_WU(v0, v0, 1);
XVSRANI_H_W(q0, v0, 1);

Please point out my fault.

Thanks,
Leslie Zhai

Copy link
Collaborator
@ksco ksco Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VEXT2XV_W_H(v0, q0);
VEXT2XV_W_H(v1, q1);
XVMUL_W(v0, v0, v1);
XVSRLI_W(v0, v0, 14);
XVADDI_WU(v0, v0, 1);
XVSRLNI_H_W(v0, v0, 0);
XVPICKOD_D(v0, v0, v0);

Not sure if this is correct, but the idea here is to use the high 128 bit of LASX.

Copy link
Collaborator
@ksco ksco Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe this to save 1 more instruction? Like I said, I'm not sure, but I do think it's doable.

VEXT2XV_W_H(v0, q0);
VEXT2XV_W_H(v1, q1);
XVMUL_W(v0, v0, v1);
XVSRLNI_H_W(v0, v0, 14);
XVADDI_HU(v0, v0, 1);
XVPICKOD_D(v0, v0, v0);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?

-XVPICKOD_D(v0, v0, v0);
+XVPICKOD_D(q0, v0, v0);

sse_intrinsics pmulhrsw failed.

GETEX(q1, 0, 0);
GETGX_empty(q0);
v0 = fpu_get_scratch(dyn);
VREPLGR2VR_D(v0, xZR);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use VXOR_V(v0, v0, v0);, which has determined latency and IPC, and it's already used everywhere.

GETEX(q1, 0, 0);
GETGX_empty(q0);
v0 = fpu_get_scratch(dyn);
VREPLGR2VR_D(v0, xZR);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

xiangzhai and others added 2 commits December 10, 2024 14:00
@xiangzhai xiangzhai requested a review from ksco December 10, 2024 09:16
@ksco ksco requested a review from ptitSeb December 10, 2024 09:34
@ptitSeb ptitSeb merged commit f1addb8 into ptitSeb:main Dec 10, 2024
27 checks passed
@xiangzhai
Copy link
Contributor Author

Thanks 🍻

@xiangzhai xiangzhai deleted the la64_dynarec_660f branch December 12, 2024 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0