gopher://codevoid.de/1/hn/comments

        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Achieving 10,000x training data reduction with high-fidelity labels
       
       
        ghm2180 wrote 49 min ago:
        Is it just me or does the showing of hyperspheres deliberately meant to
        obfuscate some kind of a trade secrets of how to select examples for
        human to send to a human?
        
        The obfuscation being use of a support vector machines which are the
        goto for selecting the Support vectors and ignoring the outliers and
        distance being defined between embedding vectors.
        
        I could be wrong they could be using something different for clustering
        or fancier like a variant of DBScan.
       
        patresh wrote 1 hour 51 min ago:
        What is the clustering performed on? Is another embedding model used to
        produce the embeddings or do they come from the LLM?
        
        Typically LLMs don't produce usable embeddings for clustering or
        retrieval and embedding models trained with contrastive learning are
        used instead, but there seems to be no mention of any other models than
        LLMs.
        
        I'm also curious about what type of clustering is used here.
       
        scribu wrote 2 hours 37 min ago:
        Iâm confused by the clustering step:
        
        > To find the most informative examples, we separately cluster examples
        labeled clickbait and examples labeled benign, which yields some
        overlapping clusters
        
        How can you get overlapping clusters if the two sets of labelled
        examples are disjoint?
       
          patresh wrote 1 hour 48 min ago:
          If the diagram is representative of what is happening, it would seem
          that each cluster is represented as a hypersphere, possibly using the
          cluster centroid and max distance from the centroid to any cluster
          member as radius. Those hyperspheres can then overlap. Not sure if
          that is what is actually happening though.
       
          cm228 wrote 2 hours 8 min ago:
          they cluster the examples with their model and then check the
          predictions against the labels.
       
        abhgh wrote 6 hours 51 min ago:
        Active Learning is a very tricky area to get right ... over the years I
        have had mixed luck with text classification, to the point that my
        colleague and I decided to perform a thorough empirical study [1], that
        normalized various experiment settings that individual papers had
        reported. We observed that post normalization, randomly picking
        instances to label is better!
        
   URI  [1]: https://aclanthology.org/2024.emnlp-main.1240/
       
        trhway wrote 9 hours 1 min ago:
        Reminds how one of the winners of the 2001 Andrew Ngâs Data-Centric 
        AI competition analyzed embeddings separation to choose training data
        
   URI  [1]: https://rensdimmendaal.com/posts/data-centric-ai
       
        ericyd wrote 13 hours 41 min ago:
        > in production traffic only very few (<1%) ads are actually clickbait
        
        That's a fascinating claim, and it does not align with my anecdotal
        experience using the web for many years.
       
          woolion wrote 6 hours 1 min ago:
          In the last 6 months, I've had to buy a few things that 'normal
          people' tend to buy (a coffee machine, fuel, ...), for which we
          didn't already have trusted sellers, and so checked Google.
          
          For fuel, Google results were 90% scams, for coffee machines closer
          to 75%
          The scams are fairly elaborate: they clone some legitimate looking
          sites, then offer prices that are very competitive -- between 50% and
          75% of market prices -- that put them on top of SEO. It's only by
          looking in details at contact information that there are some things
          that look off (one common thing is that they may encourage bank
          transfers since there's no buyer protection there, but it's not
          always the case).
          
          A 75% market rate is not crazy "too good to be true" thing, it's in
          the realm of what a legitimate business can do, and with the prices
          of the items being in the 1000s, that means any hooked victim is a
          good catch.
          A particular example was a website copying the one for a massive
          discount appliance store chain in the Netherlands. 
          They had a close domain name, even though the website looked
          different, so any Google search linked it towards the legitimate
          business.
          
          You really have to apply a high level of scrutiny, or understand that
          Google is basically a scam registry.
       
            NooneAtAll3 wrote 2 hours 11 min ago:
            didn't parent comment cited sentence about clickbait?
            
            why did you change subject to scams?
       
              woolion wrote 59 min ago:
              Parent says it's an outlandish claim that they can reliably tell
              whether ads are clickbait.
              
              I believe that detecting whether an ad is clickbait is a similar
              problem -- not exactly the same, but it suffers from the same
              issues:
              
              - it's not well defined at all.
              
              - any heuristic is constantly gamed by bad actors
              
              - it requires a deeper, contextual analysis of the content that
              is served
              
              - content analysis requires a notion of what is reputable or
              reasonable
              
              If I take an LLM's definition of "clickbait", I get
              "sensationalized, misleading, or exaggerated headlines"; so scams
              would be a subset of it (it is misleading content that you need
              to click through). They do not provide their definition though.
              
              So you have Google products (both the Products search and the
              general search) that recommend scams with an incredible rate,
              where the stakes are much higher. Is it reasonable that they're
              able to solve the general problem? How can anyone verify such a
              claim, or trust it?
       
            jacquesm wrote 3 hours 57 min ago:
            Scammers can outbid real stores on the same products for the
            advertising space simply because they have much better margins. And
            google really doesn't care about whether it is a scammer that pays
            them or a legit business, they do zero due diligence on the targets
            of the advertising.
       
          aaron695 wrote 8 hours 33 min ago:
          > it does not align with my anecdotal experience
          
          Given I'll often see the same fraudulent ad repeated I think
          anecdotal experience is there are not many of them.
          
          I can even talk to friends about the most boring fraudulent ads and
          they know them. i.e. Elon doubling your bitcoin scams.
          
          For normal ads unless they are viral, there are millions out there
          that are never repeated or not even seen.
          
          Because fraud ads have short lifetimes pulled out of 'production
          traffic' you can collect many for the training data
          
          I assume 'clickbait' is the safety word for 'fraud'
       
          vFunct wrote 9 hours 35 min ago:
          That usually means you tend to visit trash sites. Higher quality
          sites have higher quality ads. In fact, for the highest quality
          media, people actually PAY for ads. See things like Vogue September
          issue or technical shopping magazines, which earn value for being 90%
          ads. People used to buy local newspapers because of the ads as well.
       
            andrewflnr wrote 7 hours 50 min ago:
            Specifically the September issue? Is that one special?
       
          andrewmcwatters wrote 10 hours 30 min ago:
          Ad company says ads are good, water is wet, news at 11.
       
          vajrabum wrote 11 hours 41 min ago:
          Not quite the same thing but some non-negligable percentage of ads I
          see on Facebook are outright scams which purport to be selling
          musical instruments at a 'markdown'. First guitars supposedly from
          the Sam Ash bankruptcy sales linking to an obvious fake site and more
          lately 'free' giveaways of high end Gibson acoustic guitars. When
          I've reported them I got the feedback that it didn't violate
          community standards, but my insta account got perma-banned when I
          posted the original of a song on youtube from 1928 on a thread which
          started with a cover from 30 years ago. That was considered spam.
       
            galaxyLogic wrote 11 hours 22 min ago:
            Smart scammers should know that peopel know if something is too
            good to be true ("free Gibson} etc), it is probabaly fake. But
            people keep clicking, for what it's worth.
       
              adgjlsfhk1 wrote 11 hours 5 min ago:
              it's the opposite. scammers want the people that are gullible
              enough to go for "free"
       
                throwaway1004 wrote 9 hours 37 min ago:
                This is a narrative I've heard many times, with very little
                evidence to back it up.
                An alternative and more accurate view is that, as the world
                came online, people became exposed to the very low-effort
                scams, representative of criminal elements from around the
                world, which befuddled most due to their child-like naivety.
                None of those confused individuals would fall for it but they
                require an explanation. Someone came up with a theory that it's
                actually a stroke of 4D genius and it stuck.
                
                edit: ok, I bothered to look this up: Microsoft had a guy do a
                study on nigerian scams, the guys who wrote Freakonomics did a
                sequel referencing that study and drew absurb unfounded
                conclusions, which have been repeated over and over. Business
                as usual for the fig-leaf salesmen.
       
          ajb wrote 12 hours 28 min ago:
          I had that reaction as well, but consider: clickbait is such because
          it takes more work (emotional or logical) to reject it than an ad
          which is merely not relevant to you. Thus, your (and my) recall of
          ads is probably biased towards clickbait, and we overestimate its
          prevalence.
       
       
   DIR <- back to front page