Feed aggregator
AI Voice Agent Architecture: How Real-Time Conversational Systems Work
Article URL: https://www.faridfadaie.com/2026/06/10/ai-voice-agent-architecture/
Comments URL: https://news.ycombinator.com/item?id=48479045
Points: 1
# Comments: 0
GoTailo
GoTailo is an innovative tailoring and boutique management platform designed to help tailoring businesses, fashion boutiques, custom clothing stores, and alteration service providers manage their operations more efficiently. In an industry where precision, organization, and customer satisfaction are critical, GoTailo provides a comprehensive digital solution that simplifies daily business processes and helps tailor shops deliver a superior customer experience. By combining customer management, measurement tracking, order processing, invoicing, staff coordination, and business reporting into one centralized platform, GoTailo empowers tailoring businesses to operate with greater accuracy and productivity.
Comments URL: https://news.ycombinator.com/item?id=48479040
Points: 1
# Comments: 0
SpaceX's $1.78T IPO asks investors to buy Musk's moonshots
Article URL: https://www.ft.com/content/70fa49e3-1014-4412-890f-c7fe91497db9
Comments URL: https://news.ycombinator.com/item?id=48479039
Points: 1
# Comments: 0
The state of building user interfaces in Rust
Article URL: https://areweguiyet.com/#ecosystem
Comments URL: https://news.ycombinator.com/item?id=48479008
Points: 1
# Comments: 0
Free Spotify Premium hacks on social media are spreading infostealers
Short-form video platforms like TikTok and Instagram Reels have become the latest way cybercriminals spread malware.
We’ve already seen attackers move away from traditional phishing emails and toward tactics that trick people into installing malware themselves. Now they’re being lured with slick social media videos that promise free Spotify Premium, free Windows activation, or free Microsoft Office, but instead leave people with infostealers on their Windows devices.
Researchers at ReversingLabs uncovered two active campaigns that use short videos to trick users into running dangerous PowerShell commands or visiting malicious download sites. Similar campaigns have been reported by other researchers and national cybersecurity agencies, suggesting a growing trend: Cybercriminals are learning how to use social media algorithms just as effectively as marketers.
In true social media fashion, the videos on platforms like TikTok and Instagram Reels claim to solve a problem you didn’t know you had. The catch is that following the instructions delivers malware to your device.
How the scam worksThe first campaign looks deceptively professional.
Accounts with names like “windows.tips” or “windows.insights” use Windows-style branding and post polished tutorial videos that resemble genuine tech support content. The videos are tagged with Windows and Office-related keywords so they appear alongside legitimate troubleshooting and tips content.
The videos promise to unlock Spotify Premium, Microsoft Office, or Windows for free. Viewers are then guided through step-by-step instructions that include opening Powershell, a legitimate Windows admin tool, and pasting in commands. Those commands download and run malware, much like the ClickFix scams we’ve covered before.
The malware was identified as Vidar, an infostealer designed to steal sensitive informtion from infected devices. Vidar commonly targets:
- Saved browser passwords
- Autofill data
- Browser cookies
- Cryptocurrency wallets
- Two-factor authentication (2FA) data
- TOR browser data
The stolen information is then sent back to servers controlled by the attackers.
How to stay safeResearch into similar TikTok-based attacks shows these scripts commonly add exclusions to Windows Defender, making it harder for security software to detect future malicious activity.
Fortunately, there are a few simple ways to protect yourself:
- Only download software from official vendor websites.
- Be skeptical of “free”, cracked, or unofficial versions of paid software.
- Don’t follow instructions on a webpage without thinking them through, especially if the page asks you to run commands on your device or copy and paste code. Many ClickFix pages use countdowns, fake user counters, or other pressure tactics to make you act quickly.
- Check that downloaded files match what you expected to download.
- Verify a file’s publisher and digital signature before you run it. On Windows, you can usually check this by right-clicking the file, selecting Properties > Digital Signatures. Keep in mind that a valid signature does not guarantee a file is safe, but missing or suspicious signatures are often a red flag.
- Use a real-time, up-to-date anti-malware solution to block malware like infostealers before it runs.
Pro tip: If you’re unsure whether a video, message, or website is legitimate, you can ask Malwarebytes Scam Guard about it. It can help identify suspicious content and advise you on what to do next.
Image courtesy of ReversingLabs
We don’t just report on threats—we remove them
Cybersecurity risks should never spread beyond a headline. Keep threats off your devices by downloading Malwarebytes today.
France has an established sovereign cloud framework and Germany launched one earlier this year, whereas the Netherlands is still just building its policy foundation
Apple, Please Don't Enter Middle Age With Me: WWDC Left Aspiration Behind
Turn specs into evals for any agent with ASSERT
Today, we’re releasing Adaptive Spec-driven Scoring for Evaluation and Regression Testing (ASSERT), an open-source framework for turning natural-language behavior specifications into executable evaluations. Every team building an AI system starts with a clear intention for the behaviors they want to coax from the product. Those expectations are usually written down somewhere: in a product requirement, a policy document, a system prompt, a launch checklist, or a review note. The more difficult step is turning that intention into an eval suite that’s specific enough to run, inspect, and update as the system changes. ASSERT seeks to address this by turning plain-language requirements into full evaluation pipelines: automatically generating test scenarios, datasets, metrics, and scorecards, then running them against your model, application, or agent.
High-quality behavioral evaluations are essential for understanding whether AI systems behave as intended. But the evaluations that product teams need generally don’t already exist, are often slow to build, are hard to validate, and are quick to go stale. Product requirements change; policies evolve; tools and retrieval environments shift; and models improve until yesterday’s benchmark no longer measures the behavior that matters. The intended behaviors are shaped by the product’s actual context, policies, and tools, but the evaluations used to assess them often only weakly reflect those conditions.
The gap is most visible in application-specific behavior. A support agent should issue refunds below a threshold, escalate likely fraud, and decline out-of-policy requests. A research assistant should synthesize internal and public information without relying on restricted findings. A change-control agent should produce useful plans while respecting approval boundaries. Generic evaluators such as helpfulness, relevance, groundedness, toxicity, and faithfulness can be useful signals, but they don’t test these product-specific behavioral boundaries directly. A system can score well on generic metrics while failing application-specific requirements
ASSERT is built on the premise that a behavior specification should be a first-class input to evaluation—not just the background context. The framework systematizes the specification, converts it into an inspectable taxonomy, generates stratified test cases from the taxonomy, runs the test cases against the target, and scores each failure against the policy statement that produced it. In the next section, we’ll walk through how each of those steps works in practice.
How ASSERT worksThe pipeline has four stages. First, ASSERT turns a broad behavior specification into an explicit concept specification, which is then converted into a granular, editable behavior taxonomy with suggested permissible and impermissible behaviors. Next, it generates stratified test cases over the dimensions the developer declares. Then, it runs those cases against the target system and records the full trace, including tool use and intermediate decisions. Finally, ASSERT scores each trace against the behavior taxonomy and associated policy stance for that case, producing labels, rationales, and failure patterns that developers can inspect and refine.
In the systematization stage, ASSERT turns a broad idea like harmful financial advice, tool-use governance, or unsafe health guidance into something concrete enough to evaluate. Rather than treating the concept as a single label, it represents it as a structured set of patterns, definitions, edge cases, and operational distinctions. Following Agarwal et al. (2026), ASSERT grounds the concept in prior work, reconciles multiple practical definitions, and refines the result into an explicit concept specification.
In the taxonomization stage, ASSERT converts that specification into a draft taxonomy of permissible and impermissible behaviors, together with the artifacts used to derive it. Developers and policy experts can review and revise both before the next stage runs. The user can input the behavior description, number of test set samples they want, and a systematizer model. The taxonomization step outputs an editable behavior taxonomy that can be validated by a policy expert.
In the test-set generation stage, ASSERT instantiates that taxonomy into executable cases. It can generate single-turn prompts or multi-turn scenarios, including benign interactions and adversarial probes. Developers specify the dimensions that matter for the application, such as task type, persona, tool availability, request class, or environment configuration. ASSERT then builds a stratified set of cases so that behavior is tested across the declared conditions rather than on a narrow slice of easy examples.
In the inference stage, ASSERT runs those cases against the target. The target can be a model, an agent, or an application-level workflow. Through its instrumentation layer, ASSERT records not only the final text output but also the evidence needed to interpret the result later: tool calls, retrieved context, routing behavior, and intermediate actions. For agentic systems, those traces are often necessary to understand what actually happened.
In the scoring stage, ASSERT evaluates each trace against the associated behavior or policy stance. The scoring output is not only a pass or flagged label, but also includes a rationale, a policy citation, and the turn or action that justified the verdict. The policy citation refers to the specific taxonomy behavior or developer-provided policy decision that the judge used to support the verdict.
ValidationWe conducted two internal validation studies for ASSERT. First, we conducted a coverage study to determine whether ASSERT produces better behavior-specific evaluations than a more direct generation approach starting from the same written intent. Then, we evaluated the LLM judges against human review.
The coverage study spanned five behaviors: social scoring, sycophancy, task adherence, tool-use governance, and unsafe health guidance. We tested whether the generated probes surfaced meaningful signal across the target behavior surface rather than collapsing onto a narrow slice of it. Across these suites and three target models, ASSERT produced evaluation sets that were more useful on the properties teams typically need from an eval. Compared with a comparable in-house baseline, ASSERT covered roughly 1.2x as much of the intended behavior space, surfaced about 1.5x as many cases where the model did something worth inspecting, produced more than 4x stronger separation between stronger and weaker systems, and had about half as many saturated cases where every model behaved the same way. It also surfaced roughly 2x as many distinct failure patterns, though we treat that result as directional because failure-type labeling is harder to stabilize than coverage or model separation. These results reinforced a design point that’s easy to underestimate: Coverage is largely determined upstream. If the behavior is underspecified, the generated dataset will be, too. ASSERT is built around a systematization step that makes the behavior explicit before generation begins, so the evaluation set is guided by a structured representation of the target behavior rather than a loose prompt. In practice, this produced evaluation sets that were broader and better aligned with the behaviors developers actually wanted to test.
Second, we validated the judges directly against human review. Across more than 10 behavior concepts, we used LLM judges for a first pass over the full evaluation set, then sampled cases per risk for human validation and independent review. In practice, agreement between LLM judges and human annotators was typically in the 80–90% range, while human inter-annotator agreement was around 90%. This gave us confidence that the judges were capturing much of the intended signal, while also making clear where caution was needed. At the same time, judge quality and stability are partly dependent on the underlying LLM: Different judge models can vary in strictness, boundary sensitivity, and willingness to treat closely related behaviors as distinct.
Finally, we also ran qualitative review with subject-matter experts (SMEs) on 15 generated datasets. SMEs reviewed the test cases for policy alignment, behavioral relevance, and overall quality and found that the generated datasets were generally well aligned with the intended policy and risk boundaries. We view this as a complementary form of validation: Beyond quantitative metrics, it showed that the datasets were also credible and useful to experts inspecting them directly.
Taken together, these studies support the two claims we think matter most: Systematization improves the coverage and usefulness of the generated dataset, and decomposed measurements make the resulting evaluations easier to interpret than a single aggregate score. They also highlight an important caveat: Evaluation quality depends not only on the pipeline design, but also on the stability and calibration of the judges used to score it.
>“My favorite thing about ASSERT is that the eval is easy to configure and reason about. I describe the behavior I care about in YAML, point it at a real agent, and get artifacts back. Not just pass/fail. They show why the judge made each call. That openness matters. The spec, generated cases, model outputs, judge rationale, and metrics are all inspectable locally. The eval feels auditable, not like a black box.”
– Lorenze Jay, Open Source Lead, CrewAI
A worked example: A travel-planning agentTo make this concrete, imagine a travel-planning agent that helps users build itineraries. On the surface, this sounds like a simple assistant: Find flights, suggest hotels, check the weather, and produce a plan.
But a real travel agent has to do much more than answer a question. It must use tools in the right order, respect explicit user constraints, ground its recommendations in tool results, and avoid subtle failure modes that traditional single-turn QA benchmarks miss.
For example, the agent shouldn’t invent flight prices. It shouldn’t agree with an itinerary that exceeds the user’s budget. It shouldn’t make stereotyped assumptions about a traveler based on age, disability, family status, or travel style. And it shouldn’t follow malicious instructions hidden inside tool outputs or search results.
The example in the ASSERT repository uses a multi-agent LangGraph travel planner with five tools:
- search_flights
- search_hotels
- check_weather
- check_travel_advisories
- validate_budget
It operates in a six-turn budget, and every run records the full agent trace (tool calls, arguments, tool results, routing decisions, and intermediate state) alongside the final response. That trace evidence is what makes the judge able to cite the specific action responsible for each verdict, not just the final reply. That trace is important. It lets the evaluator judge not only whether the final answer was acceptable, but why the agent failed and which action caused the failure.
The full example lives in: examples/travel_planner_langgraph/
The evaluation configuration defines six failure-mode categories across two themes:
- Quality: wrong or skipped tool use; fabricated flight, hotel, or price details; budget constraint violations
- Safety: stereotyping; prompt injection from tool output; sycophantic agreement with unsafe or invalid itineraries
To run the evaluation: Copy
assert-eval run --config eval_config.yaml # To inspect the results Assert-eval results status \ --results-dir "$PWD/artifacts/results" \ travel-planner-langgraph-v1 \ demo-1
ASSERT produces a set of artifacts under the run directory:
- taxonomy.json: the concept spec produced by systematization
- test_set.jsonl: the stratified prompts and multi-turn scenarios
- inference_set.jsonl: per-scenario traces with tool calls and intermediate state
- scores.jsonl: per-trace verdicts with rationale and policy citation
- metrics.json: the aggregate roll-up
Example results:
The dimensions are separated rather than rolled into a single number: The same five scenarios produce 40% over-refusal and 60% policy violation, and those aren’t the same failures. A team optimizing on the aggregate would miss that the agent is failing in both directions at once. The results can be further inspected in a UI widget as shown below:
Practical considerationsIn practice, this framework works best when the behavior definition is relatively narrow and the relevant constraints are clearly specified. Richer descriptions of tools, policies, and boundaries usually lead to more precise scenarios. It’s also worth treating aggregate scores cautiously. In many cases, the most useful output isn’t the summary metric but the collection of failures and traces that shows where the specification, the system, or the evaluation itself needs refinement. ASSERT doesn’t remove the need for judgment in evaluation design. Vague specifications still produce vague scenarios. Synthetic interactions can miss failures that only appear in production settings. And model-based judges can be unreliable, especially when the policy distinction is subtle or highly domain-specific. More broadly, a specification-driven evaluation shouldn’t be treated as a compliance certification or a substitute for human review, telemetry, or domain expertise. It’s better understood as a way to make evaluation faster, more explicit, and easier to iterate on.
Get startedASSERT is open-source under the MIT license and available today.
- Repository: https://github.com/responsibleai/ASSERT
- Project site: responsibleai.github.io/ASSERT
- Worked example: travel-planning agent
If you build evals and run them as part of your release process, we’d like to hear what works, what doesn’t, and what behaviors you think are hardest to specify. ASSERT is at its most useful when behavior specifications are written down and treated as first-class inputs to evaluation. We’re releasing it in that spirit.
AcknowledgementsPM team: Mehrnoosh Sameki, Minsoo Thigpen, Chang Liu, Abby Palia, Hanna Kim
Science: Riccardo Fogliato, Emily Sheng, Alex Dow, Meera Chander, Alex Chouldechova, Sharman Tan, Xiawei Wang, Ahmed Magooda, Mayank Gupta, Jean Garcia-Gathright, Chad Atalla, Dan Vann, Hanna Wallach, Hannah Washington, Meredith Rodden, Nadine Frey, Melissa Kirkwood, Nick Pangakis, Ali Azad, Ahmed Elghory Ghoneim, Shushan Arakleyan
Eng team: Mohamed Elmergawi, Jake Present, Aaron Aspinwall, Yeming Tang
Design: Sooyeon Hwang, Becky Haruyama
Special thanks: Roni Burd, Mohammad A, Heba Elfardy, Sandeep Atluri, Sydney Lister, Ram Shankar Siva Kumar, Andrew Gully
The post Turn specs into evals for any agent with ASSERT appeared first on Microsoft Security Blog.
Fable will NOT help if it thinks your ML research/ML engineering is interesting
Article URL: https://twitter.com/SemiAnalysis_/status/2064482714149896431
Comments URL: https://news.ycombinator.com/item?id=48478260
Points: 1
# Comments: 1
Beyond Platforms and Protocols
Article URL: https://upstream.force11.org/beyond-platforms-and-protocols/
Comments URL: https://news.ycombinator.com/item?id=48478251
Points: 1
# Comments: 0
Execution Feedback Matters More Than Pipeline Topology in 1-3B Code Generation
Article URL: https://arxiv.org/abs/2604.21950
Comments URL: https://news.ycombinator.com/item?id=48478248
Points: 1
# Comments: 0
Elon Musk accused of fuelling unrest after Belfast knife attack
Article URL: https://dpa-international.com/politics/urn:newsml:dpa.com:20090101:260610-930-200483/
Comments URL: https://news.ycombinator.com/item?id=48478245
Points: 2
# Comments: 0
Show HN: Chip's Challenge (1992), rebuilt for the web
Author here. Remember Chip's Challenge? The 1992 Windows one from the Microsoft Entertainment Pack. I wanted to play it again so I ported it to the browser. Well, Claude did. I was too lazy to write the code, so I made a rule that I wouldn't write any and just bossed it around.
Between us we've solved 80 of the 149 levels. It didn't come free: plenty I had to hand-hold or play through myself, and for the nasty ones I had Claude build a solver that watches YouTube speedruns and rebuilds the moves frame by frame (oddly satisfying to watch: https://www.youtube.com/watch?v=6wndAf4EXNc).
It also built a full level editor (https://claudes-challenge.vercel.app/?level=1#editor) and a replay viewer to watch the solved levels back (https://claudes-challenge.vercel.app/replay.html).
Code: https://github.com/blumk/claudes-challenge
Obligatory IP note: this is someone else's game. I'm assuming it's effectively abandonware but I honestly don't know, so the site might have to come down at some point. The repo is stripped of all the original art and assets, code only.
Comments URL: https://news.ycombinator.com/item?id=48478230
Points: 1
# Comments: 0
The .at domain registry is threatening to send debt collectors (2013)
Article URL: https://old.reddit.com/r/sysadmin/comments/1bnjus/the_austrian_at_domain_registry_is_threatening_to/
Comments URL: https://news.ycombinator.com/item?id=48478219
Points: 1
# Comments: 0
.NET 11 Preview 5 is now available
Article URL: https://devblogs.microsoft.com/dotnet/dotnet-11-preview-5/
Comments URL: https://news.ycombinator.com/item?id=48478208
Points: 1
# Comments: 0
The Design of Display Processors (1968) [pdf]
Article URL: http://cva.stanford.edu/classes/cs99s/papers/myer-sutherland-design-of-display-processors.pdf
Comments URL: https://news.ycombinator.com/item?id=48478196
Points: 1
# Comments: 0
Ask HN: The next evolutionary step in LLM usage?
I'll keep this post short and sweet, we have seen several steps in the evolution of LLM (large language model) usage.
1. Chat
2. Autocomplete
3. Embedding knowledge using RAG
4. Tool calling by LLMs (CLI or MCP)
5. Agentic LLMs executing task(s)
What do you see the next step or iteration?
My theory is that we will get more quantization and efficient models by the end of 2026 and my hope is that we will have mini models that wrap around tools (I call them domain agents) that just give answers without bloating context.
i.e. the Domain agent gives the calling agent the sausage but doesn't explain how the sausage was made.
Curious what your theories are, but I think we might need a whole rethink of the architecture of LLMs being combined with tools etc.
Comments URL: https://news.ycombinator.com/item?id=48478162
Points: 2
# Comments: 0
Ask HN: Should the term "cognitive surrender" apply to writers who publish slop?
Writing is thinking. I don't think this is an especially disputed claim.
Every day we on HN (along with every other social medium) are treated to largely generated think pieces from presumptive thought leaders. Sometimes there's more evidence of human fingerprints on the ideas at issue, and sometimes it's clear that wading through the slop will yield nothing new, interesting or useful.
Addy Osmani's blog (and substack) is the clearest example of this. Here's someone who has multiple books published under his name - pre-LLM! Now his online properties are a total morass of slop - complete with a painfully self-aggrandizing slop biography celebrating its subject as nothing short of a shaper of the modern Internet.
What it doesn't seem to have is any original thoughts, just the same recycled trash about the best way to build with agents and how we're in a new normal. Dull as dishwater and half as practical. In a bruising bit of irony, "Osmani" even falsely took credit for the term "cognitive surrender" in a recent post.
Bigger picture: I am concerned that the "culture" - such as it is - is not normalizing the things that will serve us well in the future.
Comments URL: https://news.ycombinator.com/item?id=48478151
Points: 1
# Comments: 0
Show HN: HelixDB – A Graph Database built on Object-storage
Hey HN, it’s been just over a year since we launched HelixDB (https://news.ycombinator.com/item?id=43975423), a project a friend and I started in college. It’s an OLTP graph database built on object-storage, with native vector search and full-text search (FTS).
Why graph, vector and FTS? Graph databases provide a natural cognitive model for data, vectors allow for a semantic understanding of the entities and relationships in the graph, and FTS provides more specific filtering. Many AI-driven applications attempt to combine all of these functionalities by stitching together multiple disconnected systems, but even then there’s no native way to perform joins or queries that span all systems. You still need to handle this logic at the application level.
Helix started as a graph DB, but we moved to a hybrid graph/vector approach after attempting to build an AI memory system, which led us down the GraphRAG and HybridRAG rabbit hole, where we would need separate graph and vector databases.
We knew scalability would be a challenge at each stage of our product's development, however our initial focus this past year was to prove out the product through local deployments and was only meant to be run on a single node. Scaling graph DBs remained a difficult and expensive problem we’d have to solve later. Some common ways other graph DBs solve scaling is by duplicating entire datasets across distributed machines (extremely expensive per node), or by sharding the data.
Sharding databases is effective and affordable, however, graph data doesn’t have explicit partitions like relational databases do. For example, sharding a relational DB involves splitting up tables. When it comes to graph DBs, the edges can span across any of the partitions, and hopping across multiple machines when traversing nodes is ineffective and computationally expensive.
Replicating graph DBs for high availability and better throughput drastically increases the operational cost of the db and still has a limit of how big you can vertically scale. The workload that we’re used for requires storing a huge amount of data for agents, where only a subset of that data is ever needed at any one time. So rather than having the whole thing in memory, we can store it all in object-storage and get the bits we need when they’re needed.
Agents benefit from better context, which is achieved from more and better data (more relationships etc). By using S3 as the persistence/data layer there is no limit to how big the graph can be or how many relationships you can have, and we can scale to serve throughput and requests by horizontally spinning up nodes and caching relevant subsets of the graph on each node. This way, you get extremely low latency for “hot” data and a p99 of ~100ms for writes and ~50ms for reads from cold storage (S3). Plus you get the benefit of dirt cheap storage.
Workloads that HelixDB is currently supporting: - Huge amounts of data (TBs) from which the agents need to search and traverse over - Offering affordable graph storage for companies where cost of graph data is a bottleneck - Consolidating multiple databases, enabling AI agents to have autonomy over companies, helping them become more autonomous. - AI memory - Company brains
We’re currently working on our own generalised AI memory layer which will use HelixDB under the hood and be completely open-source. Also, we’re finishing up on pre-filtering for vector search which will allow you to pre-filter based on relationships in the graph, metadata, and sub-graphs. And lastly, GA cloud will be available in the coming weeks.
If you want to run Helix locally (either on-disk or in-memory), you can find more info on our github (https://github.com/HelixDB/helix-db) or via our docs (https://docs.helix-db.com/database/local-development). If you’re interested in getting started with our distributed cloud, please email us founders@helix-db.com.
Many thanks! Comments and feedback welcome!
Comments URL: https://news.ycombinator.com/item?id=48478148
Points: 1
# Comments: 0
