Size may not be the point here. Meaningful results, for a study which intends to 'say something about OA', would seem to require one use a balanced spread of OA titles. The DOAJ, for instance, is not balanced. It omits a swathe of titles, because it quickly excludes any journal which ceases or temporarily suspends publication, including wiping the journal's tables-of-contents as hosted on the DOAJ.
However, if the data mining is merely intended to 'learn how to do some data mining' then yes, I guess the DOAJ could be useful. But a better source might be Common Crawl, which as of September 2017 includes nearly all university domains. One might pick random PDFs from university repository domains, if they contain keywords indicating they are from a journal. One would then remove the articles from predatory journals, to provide a clean and balanced set of OA articles. https://commoncrawl.org/2017/09/september-2017-crawl-archive-now-available/