The data of the EMA database was collected using a web crawler. Text mining was applied to the EMA dataset to extract the most important keywords. One of the biggest challenges was the data quality contained in the FDA database.
One of the biggest challenges was the quality of the FDA database. The data needed a lot of cleaning and wrangling to be usable in the dashboard application.
Additionally, the databases of the FDA and EMA had a very different level of granularity. The two datasets were consolidated into one, using filters, sorting and mapping algorithms to facilitate their integration into an application in Power BI.
We also determined the market attractiveness and assumed that active ingredients get more and more attractive the less generics exist in a specific therapeutic area – the lower the ratio, the higher the attractiveness of the particular generic.