_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   So you want to scrape like the big boys (2021)
       
       
        qweqwe14 wrote 7 hours 19 min ago:
        > Those companies employ ill-adjusted individuals that do nothing else
        than look for the most recent techniques to fingerprint browsers [...]
        When normal people are out drinking beers in the pub on Friday night,
        these individuals invent increasingly bizarre ways to fingerprint
        browsers and detect bots ;)
        
        What's the deal with "ill-adjusted" and "normal people"? I'm gonna say
        it right now, the reason why these individuals do this is because it's
        way more interesting and fun than building some bullshit React website
        for some boring business for the 20th time (this is just an example,
        not attacking React here, no need to freak out)
        
        It's fun because you get to solve an actual real-world challenge and
        find new ways to do something. Same with things like developing
        exploits. Those who do this are not "ill-adjusted", they are in fact
        normal people that do what they are passionate about.
        
        The whole mentality of "anyone who does something I don't like is
        ill-adjusted" is just absolutely insane.
       
          snowwrestler wrote 5 hours 59 min ago:
          That entire paragraph is a joke. That’s why there is a little wink
          at the end.
       
            qweqwe14 wrote 1 hour 39 min ago:
            It's not clear if it's a joke specifically because of the addition
            of "ill-adjusted"
       
              snowwrestler wrote 49 min ago:
              That is the joke. It’s hyperbole.
       
        lyu07282 wrote 10 hours 16 min ago:
        > Every website can access rotation and velocity data from Android data
        without asking for permission.
        
        What????!!! That's nuts
       
        dang wrote 23 hours 12 min ago:
        Discussed at the time:
        
        Scrape like the big boys - [1] - Nov 2021 (189 comments)
        
   URI  [1]: https://news.ycombinator.com/item?id=29117022
       
        graemep wrote 1 day ago:
        Anti bot stuff also seems to be a security threat and privacy threat:
        preventing users from accessing your site if using VMs,  port scanning,
        various froms of fingerprinting
       
          Terr_ wrote 22 hours 37 min ago:
          I prefer the approach of an algorithmic challenge that forces the
          "new visitor" to  spend some CPU cycles.
          
          It's a clear process, doesn't involve privacy risks or strange sneaky
          games, and tends to fail in ways that a human can at least see and
          report, as opposed to mysterious outages.
       
            theamk wrote 19 hours 52 min ago:
            .. and also annoys people with slow hardware while costing very
            little to serious scrapers?
            
            How much CPU time can you burn so people on 3 year old phones can
            see it, and how much will it cost scrapers?
       
              graemep wrote 10 hours 13 min ago:
              Even a very slight challenge is a problem for scrapers: they have
              to do it far more frequently.
              
              Its better than captchas and whatever Cloudflare does in terms of
              overall nuisance.
       
        KieranMac wrote 1 day ago:
        I'm a lawyer that works in the web-scraping space, and I always chuckle
        when I read threads like this. Almost every company that we now
        consider a monopolist (or their affiliates) in the tech space used
        scraping a part of their process to build their business, and almost
        every one of those same monopolists now prohibits startups and
        competitors from scraping their data (which, invariably, is not
        actually "their" data in any sort of legally cognizable sense). And so
        perhaps the ethics of web scraping are not so straightforward. And
        neither are the legal issues associated with it.
        
        I wrote an article about that last fall that got some attention here.
        
   URI  [1]: https://news.ycombinator.com/item?id=37264676
       
          richardw wrote 12 hours 34 min ago:
          Same thing with Facebook and identity. IIRC they leveraged Google’s
          address book to get traction, but will go after you if you try store
          FB social graph data long term for anything outside their garden.
          
          You try to block the tricks you used to get growth, basically.
       
          jMyles wrote 23 hours 49 min ago:
          > And so perhaps the ethics of web scraping are not so
          straightforward.
          
          It strikes me that the _ethics_ of web scraping are extremely
          straightforward and cognizable with a terse analysis:
          
          * You can respond however you like to my HTTP request, and I can
          parse your response however I like.
          
          Simple, traditional, common.  This is the way that conversations have
          occurred since the dawn of human communication, no?
          
          > the legal issues associated with it.
          
          But aren't these, without exception, fabrics spun out of the cloth
          that shields established players with the threat of state violence? 
          This is not particularly new, and seems to fit in the
          pathetic-and-predictable file.
          
          Moreover, the broader cheap attempt to cast this in "intellectual"
          property terms, and to attach that to protection of artists and
          creators, warrants a very particular eye-roll for its illogic.
       
            theamk wrote 20 hours 7 min ago:
            Do you apply this ethics to webs scraping only, or to all other
            network communications too?
            
            Because if that's your general principles, you are making the
            internet much shittier. I still remember the old internet with open
            SMTP servers, easy-to-use comment forms, and forums which did not
            require emails and capthas. But people with "You can respond
            however you like to my HTTP request" attitude ruined it with spam,
            scam and SEO.
            
            If you only apply this to web scraping, then where do you draw the
            line and why? Can you scrape at maximum rate server can support?
            Can you scrape if this requires active action (like account
            creation?) As long as you scrape, can you also post some links to
            improve your SEO?
       
              jMyles wrote 17 hours 25 min ago:
              > Do you apply this ethics to webs scraping only, or to all other
              network communications too?
              
              I mean... if you're keying in at 20MHz and blasting a gigawatt of
              noise, then yeah you've certainly run afoul of decency and just
              law. You're changing the physical shape of the network
              environment.
              
              But if the concern is just that we don't like the bytes to which
              your signal decodes, or we don't like what you're doing with the
              response we give you, then it seems more like a speech/press
              issue.
              
              The internet needs to grow resilience such that annoyances in the
              logical layers are easy to ignore if you have the will.  But that
              almost certainly means that you don't get to police what people
              do with the content you willingly hand over, pursuant to the
              protocol in use.
       
              cal85 wrote 18 hours 28 min ago:
              > But people with "You can respond however you like to my HTTP
              request" attitude ruined it with spam, scam and SEO.
              
              I don’t see how those things relate. They all have separate
              ethical issues. You can believe it’s ok to scrape whatever info
              you can find online at the same time as believing it’s not ok
              to scam people.
       
            elicksaur wrote 23 hours 24 min ago:
            If I say, “Hey, please don’t text me anymore. I’m going to
            block this number,” and you respond by buying 500 phones in five
            cities and text me nonstop, is that ethical?
       
              jMyles wrote 22 hours 56 min ago:
              It's your job to separate the wheat from the chaff at the
              boundary of your network interface. In fact, personal boundaries
              of all sorts, from informational to emotional to physical to
              economic, are of paramount importance in the information age.
              
              Nobody (and certainly not the state) is going to erect your
              personal boundaries for you by ensuring justice in the face of
              spammy text messages (or, for that matter, hypnotic and
              manipulative social media).  This is your job - maybe your most
              important job.
              
              Just as its your job to protect your personal health and safety. 
              Nobody (and certainly not the state) is going to do that for you.
              
              Is there something about the trajectory of evolution of the
              internet that suggests to you that this is incorrect?
              
              I observe continually (seemingly perpetually) increasing traffic,
              and continually (seemingly perpetually) increasing capacity for
              general purpose computing. I also observe enormous empathy and
              cyberpunk traditions in our communities, protecting each other.
              Do my eyes and ears deceive me?
       
                paulryanrogers wrote 20 hours 48 min ago:
                Restraining orders are a thing for a reason. It's cheaper to
                harass someone out of business (intentionally or otherwise)
                than to compete on a level playing field.
                
                Being a good neighbor requires restraining oneself and making
                requests with consideration for the other party.
                
                Full disclosure: I worked for a price monitoring service that
                prided itself on crawling up to every 3 hours. Steps were
                always taken to mitigate the impact. Sometimes even asking
                hosts to allow-list the crawlers.
       
                  jMyles wrote 17 hours 28 min ago:
                  > Restraining orders are a thing for a reason.
                  
                  Sure, but for the purposes of this conversation, saying "for
                  a reason" regarding a function which is presently delegated
                  to the state is fraught with all sorts of future-proofing
                  concerns.
                  
                  It seems to me that, as a baseline, we have to agree to
                  observe the apparent trend of the internet to supplant the
                  state - to resist its censorship and influence almost
                  entirely - as an indicator that our long-term thinking needs
                  to put those relatively few state functions which are
                  essential to a peaceful society (such as restraining orders)
                  in the purview of the internet... somehow.  Maybe that will
                  prove to be unnecessary, but in the case that the state
                  fades, we'll be happy we had the foresight.
                  
                  Internet traffic is barely (and arguably, already not) under
                  human control as it is.  And in another century, it will
                  almost certainly be impossible to tell the machines 'enhance
                  your calm or else'.  Or else what?
                  
                  I agree wholeheartedly about your qualities of good neighbor
                  roles.    But I don't think they extrapolate the way you think
                  they do.
                  
                  Consider this: at every moment, your house - your literal
                  dwelling - is bombarded with high-level, semantic radio
                  traffic, from way down where the messages bounce off the
                  ionosphere all the way up to 10GHz and beyond.    But this
                  doesn't bother you.  You ignore what you don't need!  You
                  draw boundaries and personally work on strengthening them -
                  with the help of your friends and neighbors.
                  
                  The internet needs help taking this shape at the application
                  layer (and really, at all layers).  And that part is up to
                  us.  We can't just throw our hands up and say " exists for
                  some reason, doesn't it?"
       
                    paulryanrogers wrote 16 hours 15 min ago:
                    The government is our tool for regulating society when self
                    regulation fails. It may be a blunt instrument and a last
                    resort. Yet there is a place for it. We cannot entirely
                    outsource all boundaries to individuals and private
                    institutions.
                    
                    I agree it would be ideal if the Internet could be as
                    opt-in and benign as you suggest. Though I'm not even sure
                    such an architecture is possible. How do you drive down the
                    cost of listening and filtering to near zero whilst still
                    allowing the desired signal?
                    
                    And even if it were possible, consider that we do rely on
                    governments to regulate the limited radio spectrum that we
                    all have to share. Otherwise it wouldn't be an option to
                    opt in to. The signal would be drown out by whomever has
                    the strongest transmitters.
       
                      jMyles wrote 15 hours 58 min ago:
                      > The government is our tool for regulating society when
                      self regulation fails. It may be a blunt instrument and a
                      last resort. Yet there is a place for it. We cannot
                      entirely outsource all boundaries to individuals and
                      private institutions.
                      
                      I don't know who "our" refers to here, but if humans are
                      evolving into "the internet", or however you want to
                      think of this creature which is emerging over the course
                      of this century (and appears wont to accelerate over the
                      next few centuries), then I don't think the state is
                      "ours".  We can't just cover our eyes when presented with
                      the proclivity of the internet not to tolerate the state.
                      
                      > I agree it would be ideal if the Internet could be as
                      opt-in and benign as you suggest. Though I'm not even
                      sure such an architecture is possible. How do you drive
                      down the cost of listening and filtering to near zero
                      whilst still allowing the desired signal?
                      
                      Cryptography.
                      
                      > And even if it were possible, consider that we do rely
                      on governments to regulate the limited radio spectrum
                      that we all have to share. Otherwise it wouldn't be an
                      option to opt in to. The signal would be drown out by
                      whomever has the strongest transmitters.
                      
                      ...really?  Do you really believe that the state is a
                      force for coordination and openness in radio?
                      
                      The only bands which reliably continue to have these
                      characteristics are the amateur bands, which have been
                      defended by users for decades against constant
                      encroachment by a state which, if it had its druthers,
                      would've sold these bands to AT&T a long time ago.
                      
                      My sense is that, if the government thought we weren't
                      watching, they'd simply cancel the amateur radio license
                      program.  It is people standing to be counted (by taking
                      the test) that keeps these bands viable _despite_ the
                      FCC, not the other way around.
       
              andai wrote 23 hours 7 min ago:
              Not sure the metaphor works here. For example most sites let
              Google scrape them as much as it likes, but go out of their way
              to block other robots. By doing so they are effectively forcing
              the whole world to use (or support, since smaller search engines
              have to piggyback on the big ones wih special status, and pay
              them) proprietary spyware.
              
              In your analogy, most websites block everyone except the biggest
              pervert known to man.
       
                eli wrote 22 hours 41 min ago:
                Isn’t that a choice the website owner should be able to make?
       
                  jMyles wrote 22 hours 3 min ago:
                  Of course it's your choice to make.
                  
                  Is someone forcing you to respond to requests you'd prefer to
                  ignore?
       
                    theamk wrote 20 hours 4 min ago:
                    yes, people like OP who get the farms of scrapers.
                    
                    The website owners make their preferences clear with
                    robots.txt, IP blocks and other antibot technology.
                    Scrapers intentionally ignore owners' desires and force the
                    to respond.
       
                    paulryanrogers wrote 20 hours 51 min ago:
                    If crawlers are stealth DDoSing my site then I lose the
                    ability to respond entirely.
       
        sublinear wrote 1 day ago:
        > Those companies employ ill-adjusted individuals that do nothing else
        than look for the most recent techniques to fingerprint browsers ...
        When normal people are out drinking beers in the pub on Friday night,
        these individuals invent increasingly bizarre ways to fingerprint
        browsers and detect bots ;)
        
        Why not both on a friday night?
       
        anyfactor wrote 1 day ago:
        I was a professional web scraper. I still keep up to date with the
        industry.
        
        These days, you do not make money by doing web scraping; you make money
        selling services to web scrapers. There are tons of web scraping SAAS
        and services out there, as well as dozens of residential proxy
        providers.
        
        Most anti-bot mechanisms evolve so quickly that you can make a decent
        income just by working in a traditional software engineering role
        dedicated entirely to engineering anti-anti-bot solutions. As these
        mechanisms evolve rapidly, working for a web scraping company is more
        stable than pursuing web scraping as a profession.
        
        Web scrapers get paid by projects, making it an unstable job in the
        long run. High-level web scraping requires operational investments in
        residential proxies and renting out servers. Additionally, low-end jobs
        pay very little. Brightdata hosting a conference on web scraping, which
        should indicate the profitability of selling services in large-scale
        web scraping.
       
          shit_game wrote 13 hours 42 min ago:
          I've long thought that the use of residential proxies for things like
          scraping and operating large-scale bot networks is a necessity, but
          I've never really dabbled in using them, so I've never confirmed my
          suspicions about how residential proxies are used at a scale like
          this. Do you know if insecure IoT devices and malware-infected
          consumer hardware as common as one might think for this? I can't
          imagine it would either be profitable or even possible to work with
          an ISP to acquire residential IPs, which kinda leaves me thinking
          that the only option for a residential proxy service would be pretty
          clandestine.
       
            jon-wood wrote 3 hours 46 min ago:
            If you just search for "residential proxy" you'll find a lot of
            them are basically Raspberry Pis or similar shipped to people who
            are then paid for the amount of traffic that goes through it.
            Others are agents running on user's computers, I suspect at least
            some of these proxy providers aren't overly thorough about due
            diligence on how that agent got installed.
       
          primax wrote 14 hours 32 min ago:
          Is there a conference you would suggest that is the closest to
          scraping, generally speaking? As far as I know there isn't a scraping
          conference or strong community anywhere, and I'd like to learn and
          improve my skills.
       
            anyfactor wrote 4 hours 22 min ago:
            The edge that every web scraper has is the knowledge they possess.
            In my opinion, conference presentations are usually too generalized
            or geared towards pitching services related to web scraping
            solutions.
            
            There are some communities you can find in Discord, Telegram and
            most professional web scrapers are pretty active in LinkedIn and
            Twitter. The fun communities are in fact small groups of people
            with shared values and interests.
       
            jll29 wrote 8 hours 30 min ago:
            The scientific aspects (algorithms, incl. implementations,
            performance evaluation) of Web crawling (including focused
            crawling) is covered by conferences like WWW, ACM SIGIR, BCS ECIR,
            ACM WSDM and ACM CIKM.
            
            But you may refer to informal MeetUps or trade fairs; if so, google
            "Web Data Extraction Summit", "OxyCon Web Scraping Conference",
            "ScrapeCon 2024" (all past) or the forthcoming:
            
   URI      [1]: https://www.ipxo.com/events/web-data-extraction-summit-202...
       
          wanderlust123 wrote 20 hours 14 min ago:
          How do you keep up with the industry?
       
            anyfactor wrote 4 hours 2 min ago:
            It is kind of like Fight Club. There are 2-3 good communities that
            I lurk in. The people won't walk you through your scraping
            problems, but if you ask the questions to the right person
            politely, they often help.
            
            Many residential proxy and scraping experts are pretty active on
            LinkedIn. But they do not talk about scraping data, just news
            around web scraping.
       
          jimz wrote 22 hours 15 min ago:
          The irony is that before I realized it was so easy I would just open
          source the code - not on Github, mind you, since the likes of Akamai
          would DMCA pretty quickly, but playing a little bit of jurisdictional
          arbitrage I put it on Gitee - the Chinese copycat of Github. I don't
          have a background in any of this, but companies like the brag and
          it's not hard to put two and two together. It also was a practical
          way to enable me to place wagers on sports automatically - which was
          more or less my actual day job - and was pretty good for learning
          programming quickly in your late 20s.
          
          Instead almost immediately I got inundated by sneaker botters in
          China and in English from somewhere that doesn't use it as a native
          language, judging from the idiosyncratic use. I kept the code up for
          a bit but took it down not because of any legal threats (good luck
          with DMCA-ing a platform endorsed by the CCP, even though I have no
          love for the party, I also find the American attitude that places
          intellectual property over real property in practice - from my
          experience as a defense attorney - to be just as screwed up in terms
          of priorities, just a matter of degrees. What made me take it down
          was the fact that I did not want to work in a customer service job or
          really for anyone, and judging by the requests, it was mostly
          consisted of "you do the work but we'll split the profits", which I
          can't believe anyone would fall for.
          
          But since the internet is forever, some parts of code that
          specifically worked to emulate Cyberfed-Akamai from 0.8 to 2.3 are
          probably still floating around. My bad. I don't wear shoes normally -
          flip flops or nothing after having to wear a suit to work for a
          decade - and have no idea beyond what happens in NBA2K. Although
          cybersecurity firms making products that someone who learned how to
          program in their mid 20s and put online within 3 years and had it
          work should be pretty ashamed of how much they charge, considering
          that I haven't even taken a math course since 11th grade and had too
          much of an ADHD problem to watch videos or even read more than blog
          posts or documentation. Everything I learned, I learned by copying
          from Github and similar services until it worked. There must be a lot
          of snake oil being sold out there, maybe most of it, since the
          insidiousness of the whole thing is that selling bunk solutions
          seldom gets you in trouble anyway, while actual crime - rape, murder,
          robbery and the like - are largely lagging because the police simply
          prefer to complain about culture war bs instead of actually, you
          know, do their jobs. Who knew Judith Butler was THIS spot on.
       
            anyfactor wrote 3 hours 51 min ago:
            Thank you very much for sharing your story. From what I know these
            days, sneaker bots as an industry have pretty much gone downhill.
            Not because of anti-bot measures, but because the entire industry
            has essentially shifted from retail stores to eBay resllers.
            Everyone is competing to buy the first batch to the point that it
            is not worth building a sneaker bot anymore.
       
          RockRobotRock wrote 22 hours 51 min ago:
          I've been writing scrapers on Upwork for many years. I'm sick of
          doing project based work and want to work at/start a scraping SaaS.
          Any advice?
       
            anyfactor wrote 4 hours 7 min ago:
            I would recommend checking Google to see if you can find any job
            openings. Please remember that it is a niche industry, so there may
            not be many companies currently hiring. But honestly, if you are
            looking to make a full-time living, consider choosing another niche
            as web scraping jobs require you to consistently stay on top of
            your game. Most full-time jobs involve scraping data from big tech
            companies, and you are on your own to find solutions in bypassing
            anti-bot measures.
       
        gwittel wrote 1 day ago:
        I’m really mixed on this. Anti bot stuff is increasingly a pain point
        for security research.    Working in this space, I have to work against
        these systems.
        
        Threat actors use Cloudflare and other services to gate their payloads.
         That’s a problem for our customers who are trying to find/detect
        things like brand impersonation and credential phish. Cloudflare has
        been completely unhelpful.  They just don’t care.
       
          heipei wrote 1 day ago:
          Seconding this. Evading detection has become a real cake-walk since
          threat actors are able to sign up for a free Cloudflare account and
          then put their phishing site on their 2-hours old domain behind a
          level of protection backed by a $20B company. Funny that you almost
          never see phishing on Akamai ;)
          
          Disclaimer: We operate in this space so we obviously have an interest
          in being able to detect these threats going forward.
       
            spacebanana7 wrote 12 hours 11 min ago:
            Other than being the cheapest & easiest to use, is Cloudflare doing
            a particular evil here?
            
            As a webmaster I don’t want non-user traffic except search
            engines. It’s a waste of money and often entails security,
            privacy and commercial risk.
            
            Without Cloudflare I’d achieve only slightly less effective
            results using an AWS WAF, another CDN, or hand rolling solutions
            out of ipinfo etc.
       
              gwittel wrote 4 hours 20 min ago:
              They could police their content. Or if they don’t want to, they
              could meaningfully partner with the security industry - create a
              “security bots” program, respond to takedown requests in days
              not months, etc.
       
                spacebanana7 wrote 1 hour 56 min ago:
                I suppose that Cloudflare scanning payloads for known malware
                could potentially be effective if they could make the
                performance work.
                
                Closed partnerships programs are a bit concerning though. Once
                they’re up and running there’s an enormous economic
                incentive for CF to squeeze members with fees that capture the
                economic upside.
       
            KTibow wrote 17 hours 57 min ago:
            I think you can get a bot allowed by all of Cloudflare at [1] . The
            blog post I read didn't make it clear if it would apply to all of
            Cloudflare or just customer sites though.
            
   URI      [1]: https://docs.google.com/forms/d/e/1FAIpQLSdqYNuULEypMnp4i5...
       
              gwittel wrote 17 hours 19 min ago:
              You can. Sort of.  The good bots list is basically driven by a
              fixed user agent.  And customers can set their preference to not
              allow “good bots”.
              
              Not so good for security work.
              
              It’s similar to their abuse reporting.  They give your info to
              the site owner.  Gee thanks, that’s just what I want to do.
       
            throwaway48476 wrote 21 hours 43 min ago:
            Cloudflare is the ultimate example of creating the problem and
            selling the solution.
       
              zinglersen wrote 20 hours 45 min ago:
              I was under the (naive?) impression that Cloudflare a SaaS
              startup poster child. Do you mind expanding on your comment?
       
                throwaway48476 wrote 20 hours 41 min ago:
                Among other things, cloudflare hosts DoS services while selling
                DoS protection.
       
            rashkov wrote 1 day ago:
            Why not Akamai?
       
              nkozyra wrote 1 day ago:
              Cost.
       
        mellosouls wrote 1 day ago:
        A curious title.
        
        "So you want to scrape like the unethical boys?" I guess doesn't scan
        so well. Bad boys maybe?
        
        I'm pretty sure Internet Archive, etc don't in fact misrepresent what
        they are to crawl websites...
       
          CoastalCoder wrote 1 day ago:
          > "So you want to scrape like the unethical boys?"
          
          What's considered ethical is a very debated topic.
          
          An assertion that something is simply "unethical" should be seen as
          the starting point of a discussion, not as a self-evident fact.
       
            marginalia_nu wrote 1 day ago:
            If someone tells you to go away via the robots exclusion standard,
            and puts up bot mitigation to prevent you, blocks your IPs, etc.
            then clearly you do not have their consent to help yourself to the
            data.
            
            I find it really hard to see how you could twist ignoring this
            clear lack of consent, and going to great lengths to circumvent
            what was clearly put into place to prevent you from doing the very
            thing you are doing, how you could twist that into an ethical
            action.
            
            It may or may not be technically illegal to do, you're but that is
            not a statement about what is ethical.
       
              struant wrote 17 hours 2 min ago:
              Why do you think the consent of someone relaying information
              matters in the slightest when it comes to what people do with
              that information?
       
              duggan wrote 1 day ago:
              Ok, you’re building a service that scrapes e.g., property
              rental websites to find entries that are trying to scam naive
              renters.
              
              The property websites are incompetent to solve the problem, or
              don’t care, but either way they sure don’t want you scraping
              their valuable data.
              
              Is it still unethical?
       
                marginalia_nu wrote 1 day ago:
                That just makes both of you wrong.
       
                  blantonl wrote 1 day ago:
                  Agreed.  It's kind of like when a non-profit organization
                  argues that they are entitled to someone's data because
                  "we're not making a profit off of it."    That's ridiculous.
                  
                  Try asking a startup for free software licenses or seats or
                  whatever as a non-profit. "We're entitled to 40 seats of your
                  SAAS solution because we're a non-profit working to solve
                  world peace."  It's definitely within the startup's pervue 
                  to respond with a no.
       
              lambdaba wrote 1 day ago:
              Surely the ethics are more complicated then just following
              robots.txt or not. The intended usage counts, and that isn't
              captured in robots.txt.
       
                marginalia_nu wrote 1 day ago:
                If you have a noble intent, you ask the webmaster for
                permission to use the data.  Surely if they agree with your
                assessment that your intent is indeed noble, then you'll be
                given consent.
                
                I run a search engine and an internet crawler.    I do this all
                the time.  To this date I've never had a webmaster that didn't
                permit my crawler access when I've asked nicely.
       
                  CaptainFever wrote 12 hours 5 min ago:
                  > If you have a noble intent, you ask the webmaster for
                  permission to use the data.
                  
                  Is Marginalia opt in, then? Surely "not having a robots.txt"
                  ("you didn't say no!") does not equal consent. And surely you
                  could just ask all the webmasters you are scraping from for
                  permission, since you have noble intent.
                  
                  My point is that this is just hypocritical; you are placing
                  the moral boundary right below what you are doing, while
                  claiming moral superiority. If you ask others (e.g.
                  anti-search Fediverse), they would think you are immoral too.
       
                  greenbandit wrote 23 hours 44 min ago:
                  This might work in cases where those with the data are
                  engaged in noble acts, but not ever actor is.
                  
                  I scrape and process websites of actors engaged in fraud. I
                  do this to make the data more presentable to the proper
                  authorities and to help uncover further evidence of their
                  activities.
                  
                  I suspect that asking for consent would be quickly denied and
                  the data/evidence would quickly become inaccessible.
       
                  bryanrasmussen wrote 1 day ago:
                  If you have a noble intent - identify members of fascist
                  organizations - then obviously when you ask the top online
                  fascist sites if you may scrape them to build up your list of
                  online fascists - they will say no.
                  
                  OK less provocative, you have new algorithm to identify
                  inaccessible websites, your automation is scary good,
                  crawling a site you can identify many issues that most sites
                  would have to pay for a full audit to get, but now these
                  sites have problems - if you can identify their sites as
                  being inaccessible then they have to fix these problems due
                  to various accessibility standards that apply in the regions
                  they operate in. But if they don't allow you access then they
                  can maybe make an argument they are accessible due to audit
                  they did last year, at any rate they don't want to be forced
                  to spend money on accessibility issues right now which it
                  sounds like they might have to if they let you crawl their
                  site.
                  
                  Version 2 of above, some years ago I spoke about a job with a
                  big time magazine publisher in Denmark and said one of the
                  things that would make me a good employee is my knowledge of
                  accessibility and their chief of development said they didn't
                  have anyone with disabilities that used their site - so if I
                  ask that guy to crawl their site why say yes? They have no
                  users that would benefit!! Stop abusing our bandwidth
                  bleeding heart guy.
       
                    marginalia_nu wrote 1 day ago:
                    All of these seem like variations of
                    the-ends-justify-the-means, which generally tends to cut
                    both ways in unanticipated ways.
                    
                    Bullying websites into accessibility compliance will most
                    likely lead to them following the letter of the standard
                    without giving a second of thought as to whether the
                    content is in fact actually accessible.  It's very
                    difficult to get someone on board with your cause if your
                    initial contact is an antagonistic one.
       
          vasco wrote 1 day ago:
          Tell that to the most used website in the world, which is basically a
          scrapping-and-sorting machine.
       
            blantonl wrote 1 day ago:
            I can commit a code change in 2 seconds that would directly tell
            the most used website in the world to stop scrapping and sorting my
            data, and they would honor it and that would be the end of that.
            
            I'm under no illusions that they would or would not honor that in
            the future, but that's the state today.
       
          echelon wrote 1 day ago:
          > unethical
          
          Using and transforming information in useful ways is unethical if it
          results in a profit?
          
          That's what our brains do, too.
       
            llamaimperative wrote 1 day ago:
            No, destroying incentives to produce and share information is
            unethical (and more importantly, self-defeating).
            
            Brains that consume information don’t destroy that incentive,
            they produce it.
            
            Intermediating that and capturing all of the value for yourself is
            the unethical part, just like all forms of rent-seeking.
       
              echelon wrote 1 day ago:
              > destroying incentives
              
              Internet usage and content creation are increasing, not
              decreasing.
              
              I continue to publish comments, code, and images that presumably
              get used to train models. My incentive hasn't been destroyed.
              
              > rent-seeking
              
              Supply and demand set the prices.
              
              Subscription services provide value and continue to invest in
              their product, catalog, and/or service. Property owners handle
              asset ownership and upkeep problems at scale.
              
              Inefficiencies will be met with competition, and businesses not
              providing value will be out-competed.
              
              Data under-availability is an inefficiency holding us back from
              bigger and better things.
       
        blantonl wrote 1 day ago:
        This tends to be a very unpopular opinion around here, but in almost
        all cases I find Internet scraping to be unethical and downright
        malicious.  I'm not saying all cases, but I'm saying almost.
        
        A lot of the actors involved tend to be hustle culture types who think
        they are OWED your data, regardless of the ethics, laws, being a good
        citizen, whatever. They will blatantly disregard terms of service and
        hide behind massive setups such as these to circumvent protection etc.
        
        And the problem is, if you run any sort of business or service that is
        data oriented, there will be thousands of people that will do this,
        which will cause you to devote enormous amounts of time, effort, money,
        and infrastructure just to mitigate the issues involved with data
        scraping. That's before you are even addressing whether or not these
        people are "stealing" your data.  People who feel they are entitled to
        the crux of your business aren't bothered by being nice in the way they
        take it - they'll launch services that will cripple infrastructure.
        
        Whenever I deal with a scraping process that decides it wants my entire
        business, and it wants all of it RIGHT NOW, or in 5 minutes, I want to
        find the person and sit them down in a room and tell them "hey, develop
        your own ideas and business.  Ok?  Thanks"
        
        And if you think this was a problem before, it's exponentially worse
        over the past few months with every Tom, Susan, and Harry deciding they
        must have all your data to train their new LLM AI model.  By the
        thousands.
       
          juunpp wrote 19 hours 3 min ago:
          > but in almost all cases I find Internet scraping to be unethical
          and downright malicious.
          
          The Web (you said the "Internet", but you meant the Web) was not
          envisioned to be a commercial space. Your statement is antithetical
          to the original idea of the open Web. It's when the MBAs joined the
          party circa 2k and decided to profit out of it that all of these
          confused and wrong opinions about what the Web should be arose and
          that lead to the situation today. Your statement is a vast display of
          zero historical context. MBAs are obviously not very concerned with
          history. They just want to protect their own little turd for their
          own little profit and vanity, which is why they now put it behind a
          paywall, JS, and anti-bot proxies.
       
          dale_glass wrote 1 day ago:
          > Whenever I deal with a scraping process that decides it wants my
          entire business, and it wants all of it RIGHT NOW, or in 5 minutes, I
          want to find the person and sit them down in a room and tell them
          "hey, develop your own ideas and business. Ok? Thanks"
          
          That's a lot of righteous anger for somebody building a business on
          top of other people's data.
          
          "Broadcastify is the worlds largest source of public safety,
          aircraft, rail, and marine radio live audio streams."
          
          I have no sympathy whatsoever. You're just complaining about the very
          thing you're doing. If it's fair for you to do that, it's fair for
          others to do it to you.
       
            blantonl wrote 1 day ago:
            They volunteer to provide the data to us.  Every single last one of
            them.  Nowhere in our business model did we make the conscious
            decision to say "hey, look at that business, they have something,
            and I'm going to take it."
       
              schlipity wrote 19 hours 52 min ago:
              Aren't you also volunteering your data?  Don't browsers just talk
              to your webserver and say "Hey, what do you have?" and your site
              responds in kind.
       
              bsuvc wrote 1 day ago:
              Reading public website data is not "taking it". It is still
              there.
              
              Observing publicly available information is not theft, nor is it
              illegal.
              
              Of course copyright rules apply, but that is for if you reproduce
              something.
       
                blantonl wrote 19 hours 54 min ago:
                reproduce something
                
                No one is developing a 5 server cluster with 200+ 4g modems to
                observe publicly available information.  They are using said
                cluster to deliberately work around blocks, rate limits, and
                restrictions on scrapers who are scraping content solely to
                reproduce the data and use it for commercial purposes (make
                money)
       
          hipadev23 wrote 1 day ago:
          I find it aptly hilarious that your own business model at
          broadcastify.com is recording publicly accessible radio broadcasts
          and then selling access to those recordings for commercial gain.
       
            blantonl wrote 1 day ago:
            Why is that hilarious?    We developed an entire community,
            infrastructure, system, architecture, everything, from scratch, and
            provide access to something that never existed in the first place
            on the Internet.  That's a significant key difference here.
            
            This would be analogous to you thinking ancestory.com is "aptly
            hilarious" for arguing against someone just scraping their site for
            content.
            
            What makes you think you should be entitled to drive by the very
            unique house that we built, and pointing right at that house and
            saying "I think I'll take that all of that for myself!"
       
              rmbyrro wrote 21 hours 14 min ago:
              Why is it ethical if you build upon other people's data, but
              unethical if others do it?
              
              Nobody cares how valuable you think your service is. Who's the
              judge of what's entitled to scrape or not? If you think you're
              the judge, I find it somewhat arrogant.
              
              It is even more hilarious that you defend a position that, to me,
              looks authoritarian and individualistic. Might not be your
              intention, but it's what I read.
       
                blantonl wrote 20 hours 29 min ago:
                Why is it ethical if you build upon other people's data, but
                unethical if others do it?
                
                Because they GAVE IT TO ME, that's why.
                
                Who's the judge of what's entitled to scrape or not? If you
                think you're the judge, I find it somewhat arrogant.
                
                You find it arrogant that I want to protect my business
                interests from people who solely want to just "take" from the
                hard work my team has put together.  Would you be arrogant if
                you built a platform over 20+ years, and then scrapers just
                took the data for themselves?
                
                ...looks authoritarian and individualistic.
                
                These assertions are ridiculous. LOL. Hyperbole at it's finest.
       
                  rmbyrro wrote 7 hours 50 min ago:
                  Look, when you publicize information that is not a human
                  creation or art, you are GIVING IT TO THE PUBLIC.
                  
                  The berne convention intentionally left out database sui
                  generis rights outside the scope of copyright. Only in the
                  European Union you have the kind of protection you're looking
                  for. And even in the EU, I've never came across a case where
                  the law was enforced in courts. Maybe because it's a
                  ridiculous right, in my opinion, that would make information
                  flow disfuncional in society.
       
                  anigbrowl wrote 19 hours 53 min ago:
                  They gave you a right to resell their broadcast content?
       
                    theamk wrote 19 hours 42 min ago:
                    yes, US goverment did
                    
                    Please read other thread replies.
       
                  philipwhiuk wrote 20 hours 5 min ago:
                  > Because they GAVE IT TO ME, that's why.
                  
                  You gave it them when they visited you.
       
              hipadev23 wrote 1 day ago:
              Because you fail to see the very obvious parallels to scraping.
              I’m not criticizing your business (I think you provide a
              valuable service) but your hypocritical stance on what forms of
              publicly available information are allowed to be gathered and
              repackaged.
              
              Google’s original (and OpenAI’s) business model was also
              building a scraping infrastructure, system, and architecture,
              from scratch — and providing access to something that never
              existed in the first place.
       
                blantonl wrote 1 day ago:
                It's completely perpendicular, not parallel.
                
                Public safety communications are radio waves that are
                broadcasted and the ability to passively monitor them is
                enshrined in United States law.  That is a massively key
                difference.
                
                If I was sending data into your home from my infrastructure
                without any action from you whatsoever, and you were reaching
                up into the air and gathering it and repackaging it, AND the
                law said that I have no intellectual property rights to said
                data, then that's a whole different story.
       
                  CaptainFever wrote 12 hours 10 min ago:
                  You realise web scraping is a legal right too?
       
                  edgyquant wrote 23 hours 27 min ago:
                  You are scraping radio signals and selling it.    It’s an
                  exact parallel and if you fail to see this it is indeed
                  hilarious.
       
                    blantonl wrote 20 hours 43 min ago:
                    If you don't understand the difference between intercepting
                    radio signals and Web scraping, I'd say your understanding
                    of physics and technology is pretty hilarious.
                    
                    Look around in your house dude, there are radio signals
                    present in your house right now as we speak - you just
                    can't see them - the data literally exists right in your
                    home without you even having to do anything.  And the law
                    grants to the unequivocal right in the United States to
                    intercept those radio signals.
       
                      LouisSayers wrote 11 hours 59 min ago:
                      The analogy here is that a website that is connected to
                      the internet is considered "free to browse" just as a
                      radio signal is "free to listen to".
                      
                      The issue isn't listening or browsing (so long as it's
                      not DoS-ing), it's what you do with that information and
                      whether you have permission to use the information
                      (copyright of the host / broadcaster) in the way that you
                      are and in the way that was intended.
       
                      papichulo2023 wrote 20 hours 9 min ago:
                      So you only point that scrapping data is bad because the
                      cost? How do you know that the site someone is scraping
                      doesnt have fixed cost?
       
                        theamk wrote 19 hours 45 min ago:
                        no, scraping data is bad because this is against owners
                        wishes.
                        
                        In US, if you broadcast, the by law you consent to be
                        received and recorded.
                        
                        If you scrape data, there is no such law. And if you
                        get consent (say by finding the permissive robots.txt),
                        then go ahead and scrape.
       
                          sangnoir wrote 17 hours 29 min ago:
                          > no, scraping data is bad because this is against
                          owners wishes
                          
                          The broadcasters weren't happy about home cassette
                          recording either, and the case went all the way up to
                          the supreme court. If I can legally record cable,
                          then it's nit a stretch to say I can also "record"
                          what's on the public Internet for my own use.
                          
                          Morally speaking, we have to consider the other side
                          of the equation - operator may not be happy about
                          being scraped, but as a user, is it okay for me to
                          build or use a scraper-based price-comparison or
                          price-tracking platform? I'd say yes, even though
                          most sellers wouldn't want to have this data scraped.
       
                            theamk wrote 14 hours 27 min ago:
                            I see a difference between "scrape for personal
                            use", "scrape for public good" and "scrape to earn
                            money from".
                            
                            Everything is fine for personal use - you are
                            choosing how to consume the websites, and if you
                            choose to do it by extracting all the data into
                            tables, that's fine.
                            
                            Public good scraping is slightly murkier morally
                            but I guess it's also fine? Similar to "fair use"
                            copyright exceptions. (Unless it's commercial
                            companies pretending to do "public good" solely for
                            their own benefits, like AI "open dataset". Those
                            should be banned.)
                            
                            "Scrape to earn money from" is not OK. And sadly,
                            this seems to be the majority of all scraping
                            projects, such as: copy the sites wholesale and
                            display your own ads on them, collect data to train
                            AI on, for SEO (=make everyone's search results
                            worse).
                            
                            The good analogy is what would you do in a public
                            place like a cafe: can you do your personal work?
                            No problem at all. Can you put a non-commercial
                            poster or sign? This may be OK. Can you earn money
                            off it (say sell your own stuff inside)? No way.
       
                    throwaway48476 wrote 21 hours 39 min ago:
                    It is difficult to get a man to understand something, when
                    his salary depends on his not understanding it.
       
                  zarzavat wrote 1 day ago:
                  Every time you use Google you benefit from scraping. Scraping
                  is how the world works for the last 25+ years.
                  
                  You are trying to draw a distinction between data that is
                  pushed and data that is pulled, and maybe there is some
                  economic argument there in terms of resource usage, but that
                  is very context-dependent.
                  
                  In UK listening to public radio broadcasts is illegal. I
                  think this law is idiotic and ignore it. It seems you do too
                  since there appear to be streams from UK on your site :)
       
                    theamk wrote 19 hours 43 min ago:
                    Google benefits from legal scraping - ban them from
                    robots.txt and they'll stop.
                    
                    Please don't mix consensual and non-consensual scraping,
                    the difference is huge.
       
          tengbretson wrote 1 day ago:
          Is it unethical for a mouse to eat the cheese without triggering the
          trap?
       
          greenbandit wrote 1 day ago:
          I use web scraping to identify and monitor fraud.
          
          Exhibit A: [1] This website is used to recruit people to set up "lead
          generation" Google Business Profiles and leave paid reviews.
          
          Exhibit B: [2] This is an example of the Craigslist ad used to
          initially attract people to the website above.
          
          Exhibit C: [3] This is one of the Google Maps contributors which left
          paid reviews.
          
          If you start with the reviews on that profile, you'll find a network
          of Google Business Profiles for fake service-area businesses
          connected through paid reviews.
          
          Web scraping allows me to collect this type of data at scale.
          
          I also use scraping to monitor the status of fake listings. If they
          are removed, the actor behind them will often get them reinstated.
          This allows me to report them again.
          
   URI    [1]: https://archive.ph/0ZUA8
   URI    [2]: https://archive.ph/WWZuw
   URI    [3]: https://archive.ph/wip/7Xig4
       
            blantonl wrote 1 day ago:
            I don't care if you use Web scraping to solve the Israeli /
            Palestinian conflict.  You're not entitled to anyone's data,
            computers, services, etc because you've decided for altruistic
            reasons that it is appropriate.
            
            Cool use case. Love it. Fascinating stuff. But if Google told you
            to stop, would you?  Or would you instead decide to build a 5
            server cluster of 200 4G modems spread across continents to
            continue your work?  Because if you did I would assume that you've
            decided to move on from a cute little altruistic process into a
            commercial use of someone else's data to make a profit.
       
              jumby wrote 1 day ago:
              Wait - so you are saying that information on the public internet
              isn’t public? Man, I wish people would remember the origin of
              the web and the entire reason it exists. If you don’t want
              information public, protect it - otherwise, I say it’s fair
              game.
       
                blantonl wrote 1 day ago:
                Remember the OP article is about a system that is designed to
                completely and directly circumvent protections.
                
                If an organization puts a series of processes in place to
                prevent scrapers from wholesale taking data in violation of
                terms of service, and you develop a 5 server cluster of 200x 4G
                modems it's no longer "fair game" and you're directly being
                unethical in your use of someone else's services.
       
                  Spivak wrote 23 hours 34 min ago:
                  Yeah, I think it's fair to say that in the presence of
                  anti-bot measures (whether they work or not) that the content
                  on the website isn't public anymore.
                  
                  Available to someone meeting certain criteria (student
                  discount, senior discount) doesn't mean available to anyone.
                  I see no reason that "not available to be consumed by
                  autonomous agents" is somehow invalid in a way that unlimited
                  refills is only available to humans and not robots.
       
              dmkii wrote 1 day ago:
              I agree that there is a line at using someone else’s data to
              make a profit, but it is kind of ironic that you mention Google,
              because their exact business model is scraping websites to feed
              their search results and litter it with ads to make a profit. For
              me there is a big line between aggregating publicly available
              data (search results, reviews, news, job postings, etc. ) and
              intentionally violating terms of service like signing up for fake
              accounts an harvesting user data. So entitled maybe not (sites
              can try to prevent you from scraping), but if you make something
              publicly available you shouldn’t be surprised when people use
              it in ways you may not originally have intended (within legal
              boundaries of course).
       
              ansc wrote 1 day ago:
              >I don't care if you use Web scraping to solve the Israeli /
              Palestinian conflict.
              
              Maybe you should though. It's always worth it to think about
              which giant's shoulder you're standing on. It's giants all the
              way down.
       
              greenbandit wrote 1 day ago:
              > cute little altruistic process
              
              Maybe it is not the opinion which is unpopular, but the way it is
              being presented.
       
          malwrar wrote 1 day ago:
          If your business is just that you have a bundle of information and
          expose it over an open website, I’m not really sure how you’re
          able to maintain a mentality that you are somehow entitled to
          ownership of that information. You already put it out there, it’s
          now public, any illusion to exclusivity is now gone because anyone
          could come along at any time and make a copy without your knowledge.
          A moral position on this issue is even more confusing to me. Do you
          think that you e.g. own the knowledge on which radio frequencies are
          used where? Do you think you have a moral claim on ownership of
          (presumably unpaid) user-submitted information? I think the only
          legitimate moral grievance you have is high traffic volumes from
          inconsiderate scrapers.
       
            blantonl wrote 1 day ago:
            Do you think you have a moral claim on ownership of (presumably
            unpaid) user-submitted information?
            
            You damn right I do.  I own, develop, and maintain the entire
            system that enabled the body of works to exist in the first place.
            
            Do you think that you have a claim on ownership of the data because
            you drove by, saw what you liked, and decided that now you'll just
            rip the baton out of my hand?
       
              camgunz wrote 20 hours 24 min ago:
              I think your basic arguments are either:
              
              - scraping is immoral
              
              - we should bake DRM into the internet
              
              There's no technical or legal difference between a scraping or
              web request, and I can't really believe that you think that
              non-scraping web requests are immoral, so I think that probably
              isn't your argument.
              
              Moving onto DRM, I think most people don't want it baked into the
              internet. I think individual entities can choose to use it if
              they want--that's basically how you protect against scraping, so
              I think people irritated by having their content copied and thus
              devalued (or their ads replaced) should probably just do that.
       
              jMyles wrote 23 hours 46 min ago:
              > Do you think that you have a claim on ownership of the data
              because you drove by, saw what you liked, and decided that now
              you'll just rip the baton out of my hand?
              
              Are you just trolling at this point?
              
              _You are handing the baton over_ in an HTTP response.  If you
              don't want to do that, then change the logic of your server.
              
              Good grief man.
       
                what wrote 18 hours 35 min ago:
                Then any store is handing over the baton because you can walk
                in, take merchandise off the shelf, and walk out.
       
                  jMyles wrote 17 hours 41 min ago:
                  That's not at all what's happening here.  This is me walking
                  in, with a polite and well-formed request, regarding a piece
                  of merchandise: "May I have ?"
                  
                  And the store, clearly and with a signed receipt, saying,
                  "Here is the item you requested. Have a nice day."
       
              malwrar wrote 1 day ago:
              > You damn right I do. I own, develop, and maintain the entire
              system that enabled the body of works to exist in the first
              place.
              
              I don’t think that meets the bar. Running a website is
              absolutely not equivalent to the collective effort people put in
              to populate that website with the information that actually gives
              the overall artifact its value. There is a large history of
              outrage when similar information repository websites with
              user-generated content violate expectations of openness.
              Nevermind the fact that the actual information itself isn’t
              even private or proprietary, just obscure and distributed.
              
              > Do you think that you have a claim on ownership of the data
              because you drove by, saw what you liked, and decided that now
              you'll just rip the baton out of my hand?
              
              I wouldn’t claim ownership nor want to, when I scrape stuff I
              usually just want information in a different format. I’m
              confused as to how you think you can even “own” data to begin
              with. Suppose that your users uploaded songs instead of RF info,
              do you believe you own their music solely because they chose to
              share it on your site? Do you think your users would believe
              that?
       
                blantonl wrote 1 day ago:
                I’m confused as to how you think you can even “own” data
                to begin with.
                
                It's actually very simple.  If I'm in a position to restrict
                access to the data, then I own it, unless there is some legal
                authority that has jurisdiction over me that says I must make
                it available to the public.
       
                  alexey-salmin wrote 17 hours 26 min ago:
                  > It's actually very simple. If I'm in a position to restrict
                  access to the data, then I own it, unless there is some legal
                  authority that has jurisdiction over me that says I must make
                  it available to the public.
                  
                  So, in which jurisdiction are you? Because in US courts have
                  confirmed multiple times that scraping public websites is
                  legal.
                  
   URI            [1]: https://techcrunch.com/2022/04/18/web-scraping-legal...
       
                  icehawk wrote 1 day ago:
                  Given that you haven't fixed your problem with scrapers
                  (given the complaints you're making right in this thread.)
                  It's obvious you're not in a position to restrict the data--
                  otherwise you'd not be complaining about scrapers, and thus
                  you don't own it.
       
                    what wrote 18 hours 46 min ago:
                    Considering Walgreens is still fighting shoplifters, it’s
                    obvious they’re not in a position to restrict their
                    merchandise. They must not own it.
       
                      icehawk wrote 16 hours 40 min ago:
                      I'm glad you agree with my point that Walgreens owns
                      their merchandise not because they stop shoplifters and
                      restrict access, its because they purchased it and have
                      title over it, and since GP has done no such thing they
                      don't actually own it.
       
                      alexey-salmin wrote 17 hours 19 min ago:
                      Well, exactly. blantonl claims that his ownership rights
                      are based on his ability to restrict access to things
                      which is not a mainstream view.
                      
                      Your example illustrates this nicely. Walgreens owns the
                      goods on their shelves regardless of shoplifters.
       
                  malwrar wrote 1 day ago:
                  Operating a website doesn't automatically put you in that
                  position, as evidenced by the fact that scraping does not
                  require your consent to be possible. Ultimately there's
                  little practical difference between someone's eyes viewing
                  information and a program viewing that same information, a
                  copy has been made in some form. Scraping a new site takes
                  maybe a few hours of python to accomplish, the barrier is
                  low.
       
                    blantonl wrote 1 day ago:
                    I don't think you understand.  If I decide as the owner of
                    a site, that I don't want you scraping my business and I
                    block you, then I am in that position.    I'm automatically
                    in that position because I can implement the blocks
                    necessary to uphold the the terms of use of my business, or
                    I can just do it for arbitrary reasons.  Maybe you are
                    hammering my server.  Maybe I'm in a bad mood this morning
                    and don't like that you're using Python.
                    
                    I can unilaterally decide whether or not you use my
                    business, in any way shape or form, even if I just don't
                    like you, as long as I don't violate any laws
                    (discrimination etc).
       
                      malwrar wrote 1 day ago:
                      I absolutely understand, it's just not hard to make
                      scraper traffic appear as (or be) legitimate browser
                      traffic and/or simply distributed across numerous IPs.
                      Other technical controls all have trivial circumvention
                      methods. There is legal precedent (at least in the US)
                      suggesting that scraping public information may be
                      permissible under law (see HiQ Labs v. LinkedIn).
                      Scrapers only ever need to succeed once.
                      
                      Under these circumstances, how can a website operator
                      feel any sense of practical control over scrapers?
       
                        what wrote 18 hours 51 min ago:
                        This is kind of a silly argument. If a physical
                        business trespasses me for shoplifting, I can just put
                        on a disguise and go back and shoplift more. Why do
                        business think they have control over shoplifters?
       
                          wolfendin wrote 13 hours 21 min ago:
                          This is kind of a silly argument, for every item you
                          shoplift: do you ask if you can take it without
                          paying and then get granted permission?
       
          brigadier132 wrote 1 day ago:
          > hustle culture types
          
          It seems like you have this imaginary strawman that you hate and it
          seems like that's the foundation of why you dislike this.
       
            blantonl wrote 1 day ago:
            No. The foundation of why I dislike it is simple. If I own some
            data, then I get to dictate the terms of how that data is used.
            Period.
            
            “Hustle culture types” is simply a little anecdote about the
            types that would look you in the eye and tell you they are entitled
            to disregard what I said above. They’ll usually wrap it in some
            altruistic bs to justify as well.
       
              some1else wrote 1 day ago:
              Serving HTML will get you scraped. Your terms don't overrule fair
              use.
       
              Dah00n wrote 1 day ago:
              >If I own some data, then I get to dictate the terms of how that
              data is used. Period.
              
              What if you got that data from me/users and I/we claim the same
              rights (like GDPR for example)? Will you still honour ownership
              as above?
       
              throwaway11460 wrote 1 day ago:
              Why do you put it on the open internet if you don't want machines
              to find and read it?
              
              ToS is nice but you can't expect that it applies - the user (of
              the machine doing the scraping) might be a child which makes the
              potential contract automatically void, for example. Also, there
              are people under jurisdictions where such things have no power,
              or that don't recognize your rights to the data.
              
              And the whole thing of putting data out publicly and then just
              expecting machines to see the pile of data and go "oh so where do
              I sign the ToS?" is weird...
              
              Just put it behind a rate limited API key...
       
                Starman_Jones wrote 1 day ago:
                As an analogy, imagine that a gardener builds a beautiful
                flower garden, bisected by a cute stone path, which she invites
                the public to view freely, save for a single restriction; a
                sign reading "keep off the flower beds."
                
                There is a well-understood social contract here. I should not
                drive my car along the path, even if don't crush the flowers. I
                shouldn't walk on the flower beds, even if that sign isn't
                legally enforceable. And if a runaway lawnmower, RC car, or
                some other machine of mine does end up in the garden, I am
                responsible, because it was my machine.
                
                With websites, there is even a TOS specifically for scrapers -
                robots.txt. The fact that it is easy to bypass or ignore is no
                excuse for actually bypassing or ignoring it.
                
                The anonymity of the Internet functions as a ring of Gyges,
                where since people don't face consequences (even social ones),
                they feel entitled to do as they will. However, just because
                you can do something does not mean you have a right to do
                something.
       
                  photochemsyn wrote 1 day ago:
                  I think this analogy would be improved if the sign said
                  "Please don't take any pictures."  This is far more
                  restrictive than a sign saying "Please don't take any seeds
                  or cuttings."  The latter is more understandable because such
                  activity damages the flower garden (particularly if everyone
                  starts taking seeds and cuttings).
                  
                  Now let's say a photographer visits the flower garden, takes
                  images, and sells them online as post cards?  As long as the
                  photographer is not hindering other people (flooding the site
                  with repeat requests, in the analogy), it doesn't seem to be
                  a problem.
                  
                  On the other hand, let's say we don't have a flower garden,
                  we have an art gallery or a street artist's display - or the
                  pages of a recently published book.  Now the issue is
                  distributing copyrighted material without paying the
                  creator... but what if there's a broad social consensus that
                  copyright is out of control and should have been radically
                  shortened decades ago?
                  
                  The vast majority of data being scraped is not copyrightable
                  creative work, however, so as long as you're not obnoxiously
                  hammering a site, scraping seems perfectly ethical.
       
                  throwaway11460 wrote 1 day ago:
                  Robots.txt is definitely not any kind of ToS - some people
                  (Google) said they will respect it. No reason to expect
                  people even knowing about the concept - practically nobody
                  knows about it, not even most developers.
                  
                  And again - there are countries where any ToS without
                  explicit signature or other kind of legal agreement don't
                  apply at all.
                  
                  Just like writing "by using the toilet you agree to transfer
                  your soul for infinity" on a piece of toilet paper taped
                  somewhere in the vicinity of a toilet gives you nothing -
                  even if it was a more reasonable contract, nobody agreed to
                  anything.
                  
                  As for your other point, I think this is more like standing
                  next to a highway with a sign that reads "don't drive cars
                  here" and expecting people to stop and turn around. They
                  didn't even see your sign at their speed and it's kinda
                  unreasonable to expect they would be checking for that kind
                  of a sign on a highway. At least make it properly - big, red,
                  reflective (e.g. a Connection Reset, or at least 403
                  Forbidden).
       
                    anigbrowl wrote 19 hours 49 min ago:
                    Robots.txt is definitely not any kind of ToS - some people
                    (Google) said they will respect it. No reason to expect
                    people even knowing about the concept - practically nobody
                    knows about it, not even most developers
                    
                    Oh that's bullshit, how do you expect to be taken seriously
                    with such nonsense?
       
                      throwaway11460 wrote 13 hours 58 min ago:
                      Is it? Just ask around. I have web app devs around me,
                      they don't know it. Only those who actually specialize on
                      web sites (for presentation) do.
       
                    Starman_Jones wrote 1 day ago:
                    Yes, there is no legal enforcement mechanism behind
                    robots.txt. Nor do I particularly want there to be.
                    However, most people agree that reasonable requests made
                    regarding the use of someone's property should be followed.
                    The capability to do something without consequences is not
                    the same as the right to do something.
                    
                    Our gardener should not need to build a brick wall around
                    their public garden to keep your lawnmower out.
       
                blantonl wrote 1 day ago:
                What makes you think putting data on the Internet all the
                sudden means I unilaterally surrender the rights to my
                intellectual property?
                
                If I choose to make my data available to some businesses to
                make discovery of it easier, and I choose to decline to allow
                others to unilaterally copy my data to develop a different
                business, that's my right.  And it is unethical and
                unreasonable for any other person to assume otherwise that they
                are entitled to the same rights I granted someone else.
                
                If I own some data, I get to the be arbitrator of the
                who/what/when/where on the use of the data.  Period.
       
                  layer8 wrote 1 day ago:
                  Scraping doesn’t imply IP violation.
       
                  xyzzyman wrote 1 day ago:
                  > What makes you think putting data on the Internet all the
                  sudden means I unilaterally surrender the rights to my
                  intellectual property?
                  
                  Because intellectual property doesn't exist.
       
                  throwaway11460 wrote 1 day ago:
                  Sure, you can do whatever you like. Cut the connection if you
                  don't like it. But I can do whatever I like too - read the
                  data that your machine sent me, for example. If your machine
                  sends my machine data it's IMHO reasonable to expect that you
                  don't care about me having it unless we agreed otherwise. But
                  in many countries ToS is not considered a legal contract at
                  all - just having it on your site somewhere is not enough.
                  Sometimes not even having users check the ToS checkmark would
                  form a valid contract.
                  
                  There are many kinds of data that can't be owned at all.
                  Actually it's the other way around - there is a very small
                  subset of data that can be owned. You can try to cover it
                  under some kind of a non-disclosure clause in a contract, but
                  again - a contract would have to exist.
       
                    blantonl wrote 1 day ago:
                    Look, you are trying to argue that you might want to take
                    some data from me and use it in a personal, non-commercial
                    sense.    Cool.
                    
                    The entire purpose of the OP article is to develop a system
                    to directly circumvent data access and protection
                    mechanisms for profit.    Pure and simple.
                    
                    Spare me the altruistic BS.  No one is developing and
                    utilizing a cluster of freaking distributed servers with
                    forty 4G modems to do anything other than steal data from
                    services that don't want their data stolen, so they can use
                    it for profit
                    
                    You have to call a spade a spade here.
       
                      throwaway11460 wrote 1 day ago:
                      What I'm saying is - your machine is fully capable of
                      providing just the right amount of data to fulfill your
                      purposes. If you don't like people taking it all, don't
                      build a machine that gives it to them at 1 Gb/s. Stuff
                      about some ToS or rights or IP ownership is just noise.
       
          flir wrote 1 day ago:
          There's a lot of local history locked up in facebook's nostalgia
          groups. I want to archive it in an open format.
          
          I want to grab new rental listings and put them in an RSS feed, so I
          only look at each one once.
          
          That's my uses for data scraping right now. If that destroys
          someone's business, I don't actually care. Maybe it's selfish, but my
          right to re-format data for my own convenience outweighs their right
          to make a profit.
       
            blantonl wrote 1 day ago:
            If that destroys someone's business, I don't actually care. Maybe
            it's selfish, but my right to re-format data for my own convenience
            outweighs their right to make a profit.
            
            Exhibit A
       
              flir wrote 1 day ago:
              Yeah, it's as unsympathetic framing of my position as I can
              offer.
              
              But it's basically the same question as adblockers: Can I do what
              I want with the 1's and 0's on my own machine?
              
              I'm not going to accept that I owe anyone a business model.
       
                blantonl wrote 1 day ago:
                I'm not going to disagree with your use case here.
                
                But I'm going to assume that you have some level of a conscious
                and you don't really mean you could give 3 shits about someone
                else's hard work so you can have some satisfaction at home. 
                Because at face value that's exactly what you said.
       
                  flir wrote 5 hours 14 min ago:
                  BTW, kudos for presenting your point of view in a hostile
                  forum and holding your own. I should have said that up front.
       
                  flir wrote 1 day ago:
                  No, I think that's fair. Unsympathetic framing, but not
                  inaccurate. It's that whole "information wants to be free"
                  thing.
       
            throwaway11460 wrote 1 day ago:
            Not that I think you shouldn't do it or you're doing something
            wrong, but describing it as a right irks me the wrong way. You
            don't have any right to expect someone else's computers to work for
            you.
       
              flir wrote 1 day ago:
              I'm not sure how to phrase it except in terms of competing
              rights, but I take your point.
              
              At the point where I'm scraping, the data's on my computer
              though.
       
                solarkraft wrote 1 day ago:
                You could call them interests .
                
                It's often in a business's interest to format data in a
                specific way to make money, for example interlacing it with
                ads.
       
                  flir wrote 1 day ago:
                  Nice.
       
          vouaobrasil wrote 1 day ago:
          I absolutely agree. In fact, I think the problem is that like
          everything, there is an optimal point for efficiency, and crossing
          that line by making things "too easy" when it comes to data means too
          much power for one person to handle ethically. Absolute power may
          corrupt absolutely, but near absolutely power also corrupts quite
          nicely, too.
          
          In short, we should have limits to amount of scraping possible,
          simply because humans can never be trusted past a certain point to
          remain ethical. After all, ethics at its first approximation is only
          a mechanism to improve societal cohesiveness, and it only works as
          long as the person doesn't have enough power to "do away" with
          society.
       
            jumby wrote 1 day ago:
            Would you make the same argument of the inverse: data gathering?
       
       
   DIR <- back to front page